Executive Summary

How do production processes affect the pH levels in our soda brands? What is the best way to increase or decrease pH?

This analysis seeks to answer this question. Production data are subjected to statistical learning models and evaluated. The results indicate the following manipulations can increase pH:

mnf_flow < -50
usage_cont < 20
oxygen_filler < 0.1
pressure_vacuum near -5
filler_level > 115
carb_rel > 5.60

Similarly, we can minimize pH by keeping:

mnf_flow > -50
usage_cont > 24
oxygen_filler > 0.2
pressure_vacuum less than -6 or greater than -4
filler_level near 80
carb_rel near 5

Introduction

This analysis begins with the usual first steps: Examine data for missing observations, data exploration, and variable clustering. The data is split into a train (80 percent) and test (20 percent) set. Models are trained on the train set, using 100 repeats of 10-fold cross-validation to provide an unbiased estimate of the accuracy of each model.

I develop three classes of models:

Dummy models serve as benchmarks
Models that accept all availalbe information and algorithmically chose the optimal variable set
Manually reduced variable models

These models are finally evaluated by their performance on the test set in terms of root mean squared error (\(RMSE\)).

Ultimately, a random forest model provides the best performance.

The Data

The first step is to split all the data into train and test sets. The latter will not considered in any analysis until the very end, where it will provide an accurate assessment of model performance. I remove the full dataset df immediately after split. There are also a handful of observations where the dependent variable ph is not recorded; these are removed:

df <- readxl::read_excel('./data/StudentData.xlsx')
colnames(df) <- sapply(colnames(df), function(x) tolower(str_replace_all(x, ' ', '_')))
df <- df %>% tidyr::drop_na(ph)

set.seed(1804)
train_ix <- createDataPartition(df$ph, p=0.8, list = FALSE)

train <- df %>% slice(train_ix)
test <- df %>% slice(-train_ix)
rm(df)

There are some missing values in the training data, though none too severe. The variable with the most missing observations mfr is only missing about 8 percent of its values.

colSums(is.na(train)) %>% sort

##          mnf_flow           density           balling   pressure_vacuum 
##                 0                 0                 0                 0 
##                ph     air_pressurer     bowl_setpoint       balling_lvl 
##                 0                 0                 1                 1 
##         carb_flow        usage_cont          alch_rel       carb_volume 
##                 2                 5                 6                 7 
##       temperature          carb_rel     hyd_pressure1     oxygen_filler 
##                 8                 8                 9                 9 
## pressure_setpoint      filler_level     hyd_pressure2     hyd_pressure3 
##                10                12                13                13 
##     fill_pressure         carb_temp     carb_pressure          psc_fill 
##                14                17                19                19 
##    carb_pressure1     hyd_pressure4               psc       fill_ounces 
##                23                23                28                31 
##         pc_volume           psc_co2      filler_speed        brand_code 
##                33                33                40               106 
##               mfr 
##               157

To handle these, I use the newer package missRanger. Built on a fast implementation of the random forest algorithm, it imputes missing values with predictions based on the remaining variables. It iterates over all variables until the out-of-bag prediction error stops improving. Note that I do not use the dependent variable ph for the purposes of imputation:

train_imp <- missRanger(train, . ~ . - ph, pmm.k=3, num.trees=10,
                        verbose=FALSE, seed=1804)

I compared the summary statistics of the original train data set with the imputed data frame and found them satisfactory.

Data Exploration

The dependent variable ph is approximately normal. It is near-symmetric, with mean of 8.55 and median of 8.54. Almost all observations are between 8 and 9.

There are two observations with large outliers. One has an ‘isolated’ ph level beyond three standard deviations from the mean. The other is an isolated carb_rel observation, over 4 standard deviations from the mean and one standard deviation larger than the next closest observation. These two outlying observations are removed.

ph_lim <- mean(train_imp$ph) + (3 * sd(train_imp$ph))
carb_rel_lim <- mean(train_imp$carb_rel) + (4 * sd(train_imp$carb_rel))

train_imp <- train_imp %>%
  filter(ph <= ph_lim,
         carb_rel <= carb_rel_lim)

The approximately normal distribution of ph hides some complexity, however. Each of the brand codes have a different shape, although most of their central tendencies are close to the overall center. Several of the distributions have two peaks, indicating there may be two mechanisms of production, or possibly complex interactions across several variables.

Check how other variables vary across brand code:

train_imp %>%
  tidyr::gather(-brand_code, key = "var", value = "value") %>% 
  ggplot(aes(x = brand_code, y = value, color=brand_code)) +
    geom_boxplot() +
    facet_wrap(~ var, scales = "free") +
    theme_bw()

A pattern emerges: Brands A and D have similar distributions, and B and C have similar distributions. This is most visible with alch_rel, balling, balling_lvl, and density. This suggests there may be interaction effects between these variables and brand code.

Next, examine how the brand codes vary with ph:

train_imp %>%
  tidyr::gather(-ph, -brand_code, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = ph, color=factor(brand_code))) +
    geom_point(alpha=0.33) +
    geom_smooth() +
    facet_wrap(~ var, scales = "free") +
    theme_bw()

The brand codes appear to be related to the IVs and ph in similar ways, though the intercepts of the Loess lines are different. Most of the four potential interactions above do not appear that strong in this plot. However, I will retain density and carb_rel for potential interactions.

Although we have plenty of data, it could still be helpful to remove redundant variables. I use Hmisc::redun() to identity variables that can be explained by other variables with \(R^2 > .90\):

rd <- redun( ~ air_pressurer + alch_rel + balling + balling_lvl + carb_flow +
               carb_pressure + carb_pressure1 + carb_rel + carb_temp +
               carb_volume + density + fill_ounces + fill_pressure +
               filler_level + filler_speed + hyd_pressure1 + hyd_pressure2 +
               hyd_pressure3 + hyd_pressure4 + mfr + mnf_flow + oxygen_filler +
               pc_volume + pressure_vacuum + psc + psc_co2 + psc_fill +
               temperature + usage_cont, train_imp, r2=0.9, pr=TRUE)

## 2053 observations used in analysis
## Step 1 of a maximum of 29 
Step 2 of a maximum of 29 
Step 3 of a maximum of 29 
Step 4 of a maximum of 29 
Step 5 of a maximum of 29 
Step 6 of a maximum of 29 
Step 7 of a maximum of 29

print(rd)

## 
## Redundancy Analysis
## 
## redun(formula = ~air_pressurer + alch_rel + balling + balling_lvl + 
##     carb_flow + carb_pressure + carb_pressure1 + carb_rel + carb_temp + 
##     carb_volume + density + fill_ounces + fill_pressure + filler_level + 
##     filler_speed + hyd_pressure1 + hyd_pressure2 + hyd_pressure3 + 
##     hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume + 
##     pressure_vacuum + psc + psc_co2 + psc_fill + temperature + 
##     usage_cont, data = train_imp, r2 = 0.9, pr = TRUE)
## 
## n: 2053  p: 29   nk: 3 
## 
## Number of NAs:    0 
## 
## Transformation of target variables forced to be linear
## 
## R-squared cutoff: 0.9    Type: ordinary 
## 
## R^2 with which each variable can be predicted from all other variables:
## 
##   air_pressurer        alch_rel         balling     balling_lvl 
##           0.232           0.919           0.988           0.986 
##       carb_flow   carb_pressure  carb_pressure1        carb_rel 
##           0.708           0.935           0.391           0.826 
##       carb_temp     carb_volume         density     fill_ounces 
##           0.923           0.875           0.952           0.199 
##   fill_pressure    filler_level    filler_speed   hyd_pressure1 
##           0.435           0.644           0.910           0.768 
##   hyd_pressure2   hyd_pressure3   hyd_pressure4             mfr 
##           0.941           0.948           0.620           0.864 
##        mnf_flow   oxygen_filler       pc_volume pressure_vacuum 
##           0.828           0.370           0.415           0.671 
##             psc         psc_co2        psc_fill     temperature 
##           0.145           0.076           0.144           0.340 
##      usage_cont 
##           0.431 
## 
## Rendundant variables:
## 
## balling balling_lvl hyd_pressure3 carb_pressure filler_speed alch_rel
## 
## Predicted from variables:
## 
## air_pressurer carb_flow carb_pressure1 carb_rel carb_temp carb_volume density fill_ounces fill_pressure filler_level hyd_pressure1 hyd_pressure2 hyd_pressure4 mfr mnf_flow oxygen_filler pc_volume pressure_vacuum psc psc_co2 psc_fill temperature usage_cont 
## 
##   Variable Deleted   R^2      R^2 after later deletions
## 1          balling 0.988  0.969 0.967 0.967 0.967 0.955
## 2      balling_lvl 0.963        0.962 0.962 0.962 0.945
## 3    hyd_pressure3 0.945              0.944 0.943 0.943
## 4    carb_pressure 0.935                    0.935 0.935
## 5     filler_speed 0.908                          0.907
## 6         alch_rel 0.901

This analysis suggests we can drop the following variables: balling, balling_lvl, hyd_pressure3, carb_pressure, filler_speed, and alch_rel.

To get a better idea of the structure of the varaibles, we can use variable clustering,

Variable clustering:

var_tree <- hclustvar(train_imp %>%
                        select(-ph, -balling, -balling_lvl, -hyd_pressure3, 
                               -carb_pressure, -filler_speed, -alch_rel) %>%
                        select_if(is.numeric) %>%
                        as.matrix)
plot(var_tree)

Without background knowledge, it’s hard to say what concept each cluster represents. But theoretically, each cluster of similar variables represents a differnet aspect of the same concept.

This clustering is used later in this report for variable reduction.

Modeling

The number of independent variables possible in this analysis is on the edge of what is plausible for manual analysis. A dedicated data scientist could concievably spend a few weeks closely examining each variables’ relationship with ph and their interrelationship with other independent variables.

For the purposes of this project, I will be using both data reduction, parameter shrinkage, and algorithmic variable selection methods (excluding stepwise variable selection). Thus data exploration and variable selection are more tightly integrated with the process of modeling itself.

First, I develop dummy models with which we can compare more sophisticated models. If the more complex models fail to perform much better than these simpler models, it indicates something is wrong with the models or with our understanding.

Models will be judged by how they minimize \(RMSE\) on the test set. I will also consider the measures \(R^2\), and performance on the train set for diagnostic purposes.

We will use the caret cross validation schema with 10 folds and 100 repeats:

ctrl <- trainControl(method='repeatedcv', number=10, repeats=100)

Dummy models

\(M_0\): Predict the mean for each brand

This is a very simple model that predicts the brand code mean for each observation. The cross-validated \(R^2 = 0.1080\) and \(RMSE = 0.1615\). We should expect more sophisticated models to do substantially better than this one. If not, this simple model is to be preferred.

set.seed(1804)
m0 <- train(ph ~ brand_code, 
            data=train_imp, 
            method='lm', 
            trControl=ctrl,
            metric='RMSE')
m0$results

##   intercept      RMSE  Rsquared       MAE      RMSESD RsquaredSD
## 1      TRUE 0.1615239 0.1080345 0.1313097 0.006006724 0.03687632
##         MAESD
## 1 0.004516335

\(M_1\): Linear regression with all variables, interactions

The second simplest possible model assumes all independent variables are important and linearly related to ph, and uncorrelated with eachother. The variables identified above are ommitted, and interactions are included.

set.seed(1804)
m1 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 + carb_rel +
                 carb_temp + carb_volume + density + fill_ounces +
                 fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
                 hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
                 pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
                 usage_cont + density*brand_code + carb_rel*brand_code,
            data=train_imp, 
            method='lm', 
            trControl=ctrl,
            metric='RMSE')
m1$results

##   intercept     RMSE  Rsquared       MAE     RMSESD RsquaredSD       MAESD
## 1      TRUE 0.135467 0.3732627 0.1055916 0.00710571 0.05108998 0.005313304

This models performs fairly well, with \(R^2 = 0.3732\) and \(RMSE = 0.1354\). The summary shows that many variables are significant, including some of our interaction terms. The dummy variables for brand_code are not significant, however.

summary(m1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47607 -0.07980  0.00843  0.08649  0.51211 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.026e+01  1.133e+00   9.051  < 2e-16 ***
## air_pressurer          -1.558e-04  2.614e-03  -0.060  0.95249    
## carb_flow               1.152e-05  3.918e-06   2.941  0.00331 ** 
## carb_pressure1          5.757e-03  7.735e-04   7.443 1.45e-13 ***
## carb_rel                5.314e-02  1.190e-01   0.447  0.65524    
## carb_temp               8.460e-04  7.621e-04   1.110  0.26705    
## carb_volume            -7.089e-02  5.725e-02  -1.238  0.21580    
## density                -1.740e-02  5.865e-02  -0.297  0.76670    
## fill_ounces            -5.525e-02  3.636e-02  -1.519  0.12883    
## fill_pressure          -4.600e-05  1.096e-03  -0.042  0.96654    
## filler_level            1.726e-03  2.832e-04   6.094 1.31e-09 ***
## hyd_pressure1          -3.942e-05  4.123e-04  -0.096  0.92384    
## hyd_pressure2           1.136e-03  4.044e-04   2.810  0.00500 ** 
## hyd_pressure4          -1.428e-04  3.576e-04  -0.399  0.68963    
## mfr                    -1.636e-05  3.642e-05  -0.449  0.65341    
## mnf_flow               -6.632e-04  4.846e-05 -13.685  < 2e-16 ***
## oxygen_filler          -3.280e-01  8.014e-02  -4.093 4.42e-05 ***
## pc_volume              -4.667e-02  6.137e-02  -0.760  0.44712    
## pressure_vacuum        -1.055e-02  7.221e-03  -1.460  0.14433    
## psc                    -1.388e-01  6.512e-02  -2.132  0.03316 *  
## psc_co2                -9.738e-02  7.106e-02  -1.370  0.17072    
## psc_fill               -6.349e-03  2.611e-02  -0.243  0.80791    
## temperature            -1.759e-02  2.490e-03  -7.062 2.25e-12 ***
## usage_cont             -7.257e-03  1.293e-03  -5.612 2.27e-08 ***
## brand_codeB            -1.483e-01  7.222e-01  -0.205  0.83728    
## brand_codeC            -1.460e+00  8.905e-01  -1.639  0.10128    
## brand_codeD            -1.607e+00  8.303e-01  -1.935  0.05308 .  
## `density:brand_codeB`  -1.746e-01  6.761e-02  -2.583  0.00987 ** 
## `density:brand_codeC`   1.285e-01  8.031e-02   1.600  0.10970    
## `density:brand_codeD`  -1.781e-01  7.195e-02  -2.475  0.01339 *  
## `carb_rel:brand_codeB`  6.522e-02  1.333e-01   0.489  0.62462    
## `carb_rel:brand_codeC`  2.358e-01  1.664e-01   1.417  0.15659    
## `carb_rel:brand_codeD`  3.541e-01  1.535e-01   2.306  0.02120 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.134 on 2020 degrees of freedom
## Multiple R-squared:  0.3944, Adjusted R-squared:  0.3848 
## F-statistic: 41.12 on 32 and 2020 DF,  p-value: < 2.2e-16

I examine the residuals of this model in the hopes that it will suggest corrections to the model. Below are the standard residual plots. For the most part, this model appears ok. Fitted versus residuals so no relationship. The Q-Q plot, however, shows a worrying amount of deviation from normal distribution.

I also used the car package to examine partial residuals. Most of them showed the desired no-pattern pattern. However, the following four variables are especially troublesome. For each of these variables, as each variable diverges from its mean, the absolute value of error increases. My guess is this is due to complex interactions between the variables.

Algorithmic variable selection

\(M_2\): GLMnet

This model is an attempt to use variable shrinkage for variable selection. Elasticnet and ridge regression are two examples of this. The package glmnet allows us to test both models in one grid via the \(\alpha\) parameter. Again, I omit the redundant variables and include interactions:

m2_grid <- data.frame(alpha=0:1, lambda=seq(0, .1, length=100))
  
m2 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 +
                 carb_rel + carb_temp + carb_volume + density + fill_ounces +
                 fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
                 hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
                 pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
                 usage_cont + density*brand_code + carb_rel*brand_code, 
            data=train_imp,
            method='glmnet',
            preProc = c('center', 'scale'),
            trControl=ctrl,
            tuneGrid=m2_grid,
            metric='RMSE')

m2$bestTune

##   alpha      lambda
## 4     0 0.006060606

plot(m2)

The best tune is a ridge regression (\(\alpha = 0\)) with regularization parameter \(\lambda = 0.0061\). This indicates little regularization is required, so the resulting model will be close to \(M_1\) normal regression with interactions. Arguably this could be expected, as the redunancy analysis already eliminated the variables ridge regression would shrink.

This model generates a cross-validated \(RMSE = 0.1385\) and \(R^2 = 0.3518\)—slightly worse than vanilla regression, but about equal.

\(M_3\): Partial Least Squares

Partial least squares regression performs variable selection as well. It projects the independent variables \(X\) and the dependent data \(Y\) into optimal numeric space for prediction. It’s “mechanics” are the same as principle components analysis, which can also be used for variable reduction.

m3_grid <- data.frame(.ncomp=2:26)

m3 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 +
                 carb_rel + carb_temp + carb_volume + density + fill_ounces +
                 fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
                 hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
                 pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
                 usage_cont + density*brand_code + carb_rel*brand_code,
            data=train_imp,
            method='pls',
            trControl=ctrl,
            tuneGrid=m3_grid,
            metric='RMSE')
print(m3$bestTune)

##    ncomp
## 20    21

plot(m3)

The optimization routines finds that 21 components best fits the data. This results in a cross-validated \(RMSE = 0.1353\) and \(R^2 = 0.3746\)—just slightly better than linear regression.

\(M_4\): MARS

None of the previous variable reduction techniques have proven superior to linear regression. However, there is good reason a priori to believe mutlivariate adapative regression spline technique (MARS) may substantially improve on previous methods. None of the previous algorithms are really designed to handle nonlinear relationships or substantial interactions between variables—which we know (or at least surmise) are contained in this dataset. MARS, however, is designed for this.

Grid search occurs across degree of interactions under consideration, and number of variables we will permit to retain in the final model. (Due to computational constraints, I use 10-fold cross validation without repeats here.) Since we know there are interactions in the data, I do not consider models without interactions:

m4_grid <- expand.grid(degree=2:3, nprune=5:26)
m4 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 +
                 carb_rel + carb_temp + carb_volume + density + fill_ounces +
                 fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
                 hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
                 pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
                 usage_cont + brand_code,
            data=train_imp,
            method='earth',
            trControl=trainControl(method='cv', number=10),
            tuneGrid=m4_grid,
            metric='RMSE')
print(m4$bestTune)

##    nprune degree
## 41     23      3

plot(m4)

Best tune is \(nprune = 23\) and \(degree = 3\), with a cross-validated \(RMSE = 0.1219\) and \(R^2 = 0.4949\). Our intution was correct, and the MARS model’s ability to capture complex interactions and nonlinearities produced superior performance.

From coef, we see where the algorithm has picked up on these complexities. For instance, the output suggests a ‘kink’ in mnf_flow near 0.2, which also interacts with pressure_vacuum.

head(coef(m4$finalModel))

##                             (Intercept) 
##                            8.3966151719 
##                         h(mnf_flow-0.2) 
##                            0.0006900883 
##                         h(0.2-mnf_flow) 
##                            0.0022994971 
## h(0.2-mnf_flow) * h(-5-pressure_vacuum) 
##                           -0.0015102289 
##           h(mnf_flow-0.2) * brand_codeD 
##                            0.0010688054 
##         h(146.6-mnf_flow) * brand_codeC 
##                           -0.0005251704

\(M_5\): Random forest

The random forest algorithm is also known to handle interactions well. It develops decision trees, often hundreds, and combines their predictions together, thus reducing the variance of decision trees without increasing the bias of the model.

# m5_grid <- expand.grid(mtry=8:26)
# m5 <- train(ph ~ air_pressurer + carb_flow + carb_pressure1 +
#                  carb_rel + carb_temp + carb_volume + density + fill_ounces +
#                  fill_pressure + filler_level + hyd_pressure1 + hyd_pressure2 +
#                  hyd_pressure4 + mfr + mnf_flow + oxygen_filler + pc_volume +
#                  pressure_vacuum + psc + psc_co2 + psc_fill + temperature +
#                  usage_cont + brand_code, 
#             data=train_imp,
#             method='rf',
#             trControl=trainControl(method='cv', number=10),
#             tuneGrid=m5_grid,
#             metric='RMSE')
#saveRDS(m5, 'm5.rds')
m5 <- readRDS('m5.rds')
plot(m5)

print(m5)

## Random Forest 
## 
## 2053 samples
##   24 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1848, 1847, 1848, 1847, 1848, 1848, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   MAE       
##    8    0.10249342  0.6615939  0.07652869
##    9    0.10211819  0.6623477  0.07594921
##   10    0.10143424  0.6659025  0.07530823
##   11    0.10079834  0.6692978  0.07498000
##   12    0.10054612  0.6696493  0.07443819
##   13    0.10077385  0.6672601  0.07466721
##   14    0.10040009  0.6691800  0.07425781
##   15    0.09996865  0.6715474  0.07389853
##   16    0.09992275  0.6712791  0.07383834
##   17    0.09971171  0.6719506  0.07362730
##   18    0.09947903  0.6731124  0.07340238
##   19    0.09949184  0.6722239  0.07336658
##   20    0.09928817  0.6733215  0.07325301
##   21    0.09946314  0.6708454  0.07317070
##   22    0.09945228  0.6703219  0.07313559
##   23    0.09924319  0.6720486  0.07297231
##   24    0.09930293  0.6710184  0.07305435
##   25    0.09937296  0.6696641  0.07303984
##   26    0.09931272  0.6699625  0.07283086
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 23.

Best model has mtry=23, with cross validated \(RMSE = 0.0992\) and \(R^2 = 0.6720\). Both measures are substantially better than previous models!

Variable clustering

The previous section focused on algorithmic variable selection, where the full (near-full) data set was fed to the algorithms, which were left to their own devices to determine which variables were important.

In this section, I will manally reduce the variables, based on variable clustering. I then apply statistical learning algorithms.

Recall the variable clustering tree from before:

I parse out 7 clusters, based on this tree. For each of these clusters, I find the variables’ Spearman rank correlation to ph, and select that with the most correlation as the ‘representative’ of this cluster. This results in the following variable set:

carb_rel
temperature
psc_co2
pressure_vacuum
bowl_setpoint
mnf_flow
pc_volume

and brand_code.

##       carb_volume       fill_ounces         pc_volume     carb_pressure 
##        0.06047926       -0.08173385        0.08718200        0.07054020 
##         carb_temp               psc          psc_fill           psc_co2 
##        0.02648804       -0.08654489       -0.02715838       -0.06877569 
##          mnf_flow    carb_pressure1     fill_pressure     hyd_pressure1 
##       -0.48094249       -0.10466227       -0.25257874       -0.11828725 
##     hyd_pressure2     hyd_pressure3     hyd_pressure4      filler_level 
##       -0.19610311       -0.23563152       -0.13371287        0.35631838 
##      filler_speed       temperature        usage_cont         carb_flow 
##        0.04804262       -0.19964251       -0.35677925        0.04608193 
##           density               mfr           balling   pressure_vacuum 
##       -0.01072597        0.03801699       -0.03507307        0.24883819 
##                ph     oxygen_filler     bowl_setpoint pressure_setpoint 
##        1.00000000        0.18585281        0.40494642       -0.30793401 
##     air_pressurer          alch_rel          carb_rel       balling_lvl 
##       -0.01209651        0.10068454        0.19043588        0.07626197

\(M_6\): Linear model with reduced variables

set.seed(1804)
m6 <- train(ph ~ brand_code*carb_rel + temperature + psc_co2 + pressure_vacuum +
                 bowl_setpoint + mnf_flow + pc_volume,
            data=train_imp, 
            method='lm', 
            trControl=ctrl,
            metric='RMSE')
m6$results

##   intercept      RMSE  Rsquared      MAE      RMSESD RsquaredSD
## 1      TRUE 0.1394137 0.3355636 0.110679 0.006599649 0.04752279
##         MAESD
## 1 0.005190041

We see the reduced model performs only slightly worse than the full linear model \(M_1\) on both \(RMSE\) and \(R^2\). This indicates that a much smallet set of variables are actually informative.

\(M_7\): Generalized additive model with reduced variables

GAMs function like generalized linear models (GLM), except the estimated relationships are not necessarily linear. Thus GAMs are more flexible and appropriate for modeling nonlinear relationships.

The caret package’s support for GAMs is not great, so I omit its use here:

m7 <- mgcv::gam(ph ~  brand_code + brand_code:carb_rel + s(carb_rel, bs='ps') + s(temperature, bs='ps') +
                s(psc_co2, bs='ps') + s(pressure_vacuum, bs='ps') +
                s(bowl_setpoint, bs='ps') + s(mnf_flow, bs='ps') + s(pc_volume, bs='ps'), 
                data=train_imp, method="REML")
performance::rmse(m7)

## [1] 0.1308967

performance::r2(m7)

## $R2_Nagelkerke
## Nagelkerke's R2 
##       0.4159687

This results in a non-cross-validated \(RMSE = 0.1309\), and \(R^2 = 0.4160\). This is superior to both \(M_1\), linear regression with full variables and interactions, and the above \(M_7\), linear regression with reduced variable set. However, it is nowhere near the random forest model. Arguably, this suggests that the difficulty modeling this data is due to interactions, and not nonlinearities.

Evaluation

In this section, each of the nine models developed will be tested on the hold out dataset. The highest-performing model will be analyzed in the next section to determine the best way to maniplate ph.

First, we have to fill in missing data in the same way as above:

colSums(is.na(test))

##        brand_code       carb_volume       fill_ounces         pc_volume 
##                14                 3                 7                 6 
##     carb_pressure         carb_temp               psc          psc_fill 
##                 8                 9                 5                 4 
##           psc_co2          mnf_flow    carb_pressure1     fill_pressure 
##                 6                 0                 9                 4 
##     hyd_pressure1     hyd_pressure2     hyd_pressure3     hyd_pressure4 
##                 2                 2                 2                 5 
##      filler_level      filler_speed       temperature        usage_cont 
##                 4                14                 4                 0 
##         carb_flow           density               mfr           balling 
##                 0                 0                51                 0 
##   pressure_vacuum                ph     oxygen_filler     bowl_setpoint 
##                 0                 0                 2                 1 
## pressure_setpoint     air_pressurer          alch_rel          carb_rel 
##                 2                 0                 1                 0 
##       balling_lvl 
##                 0

test_imp <- missRanger(test, . ~ . - ph, pmm.k=3, num.trees=10, verbose=FALSE, seed=1804)

Now, calculate performance metrics on the imputed test set:

Model	Description	Train \(RMSE\)	Train \(R^2\)
\(M_0\)	Dummy: Predict brand’s mean	0.1641	0.1300
\(M_1\)	LR, all IVs + interactions	0.1401	0.3651
\(M_2\)	Ridge, all IVs + interactions	0.1400	0.3669
\(M_3\)	PLS, all IVs + interactions	0.1388	0.3774
\(M_4\)	MARS, all IVs + interactions	0.1229	0.5127
\(M_5\)	Random forest, all IVs	0.0990	0.6908
\(M_6\)	LR, reduced IVs	0.1445	0.3250
\(M_7\)	GAM, reduced IVs	0.1363	0.3996

Thus the random forest model performs the best, substantially better than the next closed model that used MARS. Other interesting observations:

For the most part, the performance on the test set is very close to the cross-validated performance on the training set
There is little difference between vanilla regression model and more complicated regressions (PSL, Ridge, GAM). Only when the algorithm starts to introduce complex interactions do we see improved performance (MARS).
The reduced variable LR performs nearly as well with only six variables and the full data set. This suggests variable clustering successfully captured the key driving phenokemona of the data set.

Recommendations and Conclusion

This section analyzes the model to develop recommendations for manipulating the pH levels of soft drinks.

The plot below shows how important each independent variable is to the final outcome. The variable mnf_flow dominantes the ranking. The dummy variable for brandC is recorded as being very important, which is an interesting result.

Let’s ‘zoom in’ on the most important variables to understand how changes in them affect pH:

Thus the model suggests we can increase pH by keeping:

mnf_flow < -50
usage_cont < 20
oxygen_filler < 0.1
pressure_vacuum near -5
filler_level > 115
carb_rel > 5.60