Assignment

ABC Beverage is conducting an internal review of its manufacturing process with a focus on understanding the factors that influence product pH levels. This technical report highlights our team’s data science-driven approach to modeling and predicting pH using historical production data.

To achieve this, we analyzed the dataset provided by ABC Beverage, performed data cleaning and exploratory analysis, and evaluated several models. These included linear and nonlinear regressions, decision trees, random forests, and more. Through the report we will outline our methodology, modeling process, evaluation strategy, and the final model selected for predicting pH levels.

Setup

library(corrplot)
library(caret)
library(earth)
library(e1071)
library(Formula)
library(GGally)
library(ggpubr)
library(glmnet)
library(glue)
library(janitor)
library(knitr)
library(Metrics)
library(plotmo)
library(plotrix)
library(pls)
library(RANN)
library(randomForest)
library(readr)
library(readxl)
library(rpart)
library(rpart.plot)
library(tidyverse)

Load the Data

First we must load the data. We have been provided two datasets. We presumed that one was designated for training (“StudentData.xlsx”) and the other for testing (“StudentEvaluation.xlsx”). But, we found “StudentEvaluation.xlsx” does not have target variable values. So, we concluded that “StudentData.xlsx” is for both training and testing and that “StudentEvaluation.xlsx” is for our final prediction.

train_data <- read_excel("StudentData.xlsx")
test_data <- read_excel("StudentEvaluation.xlsx")

EDA

George Hagstrom, our professor of DATA 607, defined exploratory data analysis as “…the art of looking at data in a systematic way in order to understand the the underlying structure of the data. It has two main goals: ensure data quality and uncover patterns to guide future analysis. EDA is detective work: you ask and answer questions, which inspires more questions.” And that is exactly what we will do. We are going to get to know our dataset\[s\] by summarizing it, scrutinizing it, looking at it’s correlation properties, and visualizing it.

Summary Information

We will take a look at a sample of the data. We shall also examine our data’s columns and types, the number of rows and columns, and the first few rows of the data. And finally, we will calculate some summary statistics: the mean, median, min, max, and standard deviation of each column so that we may better understand the nature of our data.

head(train_data)
## # A tibble: 6 × 33
##   `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##   <chr>                <dbl>         <dbl>       <dbl>           <dbl>
## 1 B                     5.34          24.0       0.263            68.2
## 2 A                     5.43          24.0       0.239            68.4
## 3 B                     5.29          24.1       0.263            70.8
## 4 A                     5.44          24.0       0.293            63  
## 5 A                     5.49          24.3       0.111            67.2
## 6 A                     5.38          23.9       0.269            66.6
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## #   `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …
str(train_data)
## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
##  $ Brand Code       : chr [1:2571] "B" "A" "B" "A" ...
##  $ Carb Volume      : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num [1:2571] 24 24 24.1 24 24.3 ...
##  $ PC Volume        : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num [1:2571] 141 140 145 133 137 ...
##  $ PSC              : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num [1:2571] 119 122 120 115 118 ...
##  $ Fill Pressure    : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num [1:2571] 121 119 120 118 119 ...
##  $ Filler Speed     : num [1:2571] 4002 3986 4020 4012 4010 ...
##  $ Temperature      : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num [1:2571] 2932 3144 2914 3062 3054 ...
##  $ Density          : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num [1:2571] 725 727 735 731 723 ...
##  $ Balling          : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num [1:2571] 143 143 142 146 146 ...
##  $ Alch Rel         : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

We can see there are 33 columns in total and 2,571 rows in total. The 33 columns includes the “PH” column which is what we will be aiming to predict – PH will be the response variable in our linear regression exploration.

We can see that we have mostly numeric values. The only character or categorical (non-numeric) is the Brand Code. Based on information in our team’s preferred guidance, by Max Kuhn and Kjell Johnson, on predictive modeling, to deal with non-numeric values (a.k.a. categorical values, of which Brand Code is the only one) the recommendation is to either convert into dummy variables or remove if not informative or the value has too many categories. We have found that it has significant importance in some of our models, we will convert it to a dummy variable before training (since models such as Lasso requires input predictors to be numeric).

We can also see that there are negative values which means we won’t be able to apply the BoxCox method for processing our data. Our guidance and standards, from Applied Predictive Modeling, state that if the data includes negatives we should use the YeoJohnson method instead.

We will also take a look at the summary statistics of our data. We would normally use the summary() function, but it prints wide and not long and does not present well. So, we are going to take a stick shift approach.

df <- select(train_data, -`Brand Code`)
data.frame(
  mean = sapply(df, mean, na.rm = TRUE) |> round(3),
  median = sapply(df, median, na.rm = TRUE) |> round(3),
  min = sapply(df, min, na.rm = TRUE) |> round(3),
  max = sapply(df, max, na.rm = TRUE) |> round(3),
  sd = sapply(df, sd, na.rm = TRUE) |> round(3),
  sd.normal = sapply(df, function(x) {
    x = x[!is.na(x)]
    sd(x) / (max(x) - min(x))
  }) |> round(3)
) |>
  kable()
mean median min max sd sd.normal
Carb Volume 5.370 5.347 5.040 5.700 0.106 0.161
Fill Ounces 23.975 23.973 23.633 24.320 0.088 0.127
PC Volume 0.277 0.271 0.079 0.478 0.061 0.152
Carb Pressure 68.190 68.200 57.000 79.400 3.538 0.158
Carb Temp 141.095 140.800 128.600 154.000 4.037 0.159
PSC 0.085 0.076 0.002 0.270 0.049 0.184
PSC Fill 0.195 0.180 0.000 0.620 0.118 0.190
PSC CO2 0.056 0.040 0.000 0.240 0.043 0.179
Mnf Flow 24.569 65.200 -100.200 229.400 119.481 0.363
Carb Pressure1 122.586 123.200 105.600 140.200 4.743 0.137
Fill Pressure 47.922 46.400 34.600 60.400 3.178 0.123
Hyd Pressure1 12.438 11.400 -0.800 58.000 12.433 0.211
Hyd Pressure2 20.961 28.600 0.000 59.400 16.386 0.276
Hyd Pressure3 20.458 27.600 -1.200 50.000 15.976 0.312
Hyd Pressure4 96.289 96.000 52.000 142.000 13.123 0.146
Filler Level 109.252 118.400 55.800 161.200 15.698 0.149
Filler Speed 3687.199 3982.000 998.000 4030.000 770.820 0.254
Temperature 65.968 65.600 63.600 76.200 1.383 0.110
Usage cont 20.993 21.790 12.080 25.900 2.978 0.215
Carb Flow 2468.354 3028.000 26.000 5104.000 1073.696 0.211
Density 1.174 0.980 0.240 1.920 0.378 0.225
MFR 704.049 724.000 31.400 868.600 73.898 0.088
Balling 2.198 1.648 -0.170 4.012 0.931 0.223
Pressure Vacuum -5.216 -5.400 -6.600 -3.600 0.570 0.190
PH 8.546 8.540 7.880 9.360 0.173 0.117
Oxygen Filler 0.047 0.033 0.002 0.400 0.047 0.117
Bowl Setpoint 109.327 120.000 70.000 140.000 15.303 0.219
Pressure Setpoint 47.615 46.000 44.000 52.000 2.039 0.255
Air Pressurer 142.834 142.600 140.800 148.200 1.212 0.164
Alch Rel 6.897 6.560 5.280 8.620 0.505 0.151
Carb Rel 5.437 5.400 4.960 6.060 0.129 0.117
Balling Lvl 2.050 1.480 0.000 3.660 0.870 0.238

Looking at the variance, we see that MFR is very low. On the other hand, Mnf Flow and Hyd Pressure3 have relatively high amounts of variance.

Missing Values (NA)

Let’s quantify how much data is missing from our data.

count_missing <- function(data) {
  data |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      everything(),
      names_to = "variable",
      values_to = "missing"
    ) |>
    filter(missing > 0) |>
    mutate(
      ratio = missing / nrow(data)
    ) |>
    arrange(desc(missing))
}

count_missing(train_data) |>
  kable()
variable missing ratio
MFR 212 0.0824582
Brand Code 120 0.0466744
Filler Speed 57 0.0221704
PC Volume 39 0.0151692
PSC CO2 39 0.0151692
Fill Ounces 38 0.0147802
PSC 33 0.0128355
Carb Pressure1 32 0.0124465
Hyd Pressure4 30 0.0116686
Carb Pressure 27 0.0105018
Carb Temp 26 0.0101128
PSC Fill 23 0.0089459
Fill Pressure 22 0.0085570
Filler Level 20 0.0077791
Hyd Pressure2 15 0.0058343
Hyd Pressure3 15 0.0058343
Temperature 14 0.0054454
Oxygen Filler 12 0.0046674
Pressure Setpoint 12 0.0046674
Hyd Pressure1 11 0.0042785
Carb Volume 10 0.0038895
Carb Rel 10 0.0038895
Alch Rel 9 0.0035006
Usage cont 5 0.0019448
PH 4 0.0015558
Mnf Flow 2 0.0007779
Carb Flow 2 0.0007779
Bowl Setpoint 2 0.0007779
Density 1 0.0003890
Balling 1 0.0003890
Balling Lvl 1 0.0003890
count_missing(test_data) |>
  kable()
variable missing ratio
PH 267 1.0000000
MFR 31 0.1161049
Filler Speed 10 0.0374532
Brand Code 8 0.0299625
Fill Ounces 6 0.0224719
PSC 5 0.0187266
PSC CO2 5 0.0187266
PC Volume 4 0.0149813
Carb Pressure1 4 0.0149813
Hyd Pressure4 4 0.0149813
PSC Fill 3 0.0112360
Oxygen Filler 3 0.0112360
Alch Rel 3 0.0112360
Fill Pressure 2 0.0074906
Filler Level 2 0.0074906
Temperature 2 0.0074906
Usage cont 2 0.0074906
Pressure Setpoint 2 0.0074906
Carb Rel 2 0.0074906
Carb Volume 1 0.0037453
Carb Temp 1 0.0037453
Hyd Pressure2 1 0.0037453
Hyd Pressure3 1 0.0037453
Density 1 0.0037453
Balling 1 0.0037453
Pressure Vacuum 1 0.0037453
Bowl Setpoint 1 0.0037453
Air Pressurer 1 0.0037453

As we can see, there are a fair number of missing values. PH’s four missing values may not be topping the charts; nonetheless, it may be the most problematic. Guidance to handle this scenario is to remove the rows where the PH is null from the training and test sets so that would include removing the rows in the predictor and response sets. We will explain more on this when we get to that step prior to training our model.

Distribution

We are temporarily recoding Brand Code to be numeric so that we can pivot our dataset to be long. And we have a handle on missing values, so we are filtering them out.

train_data |>
  mutate(
    `Brand Code` = recode(
      `Brand Code`,
      "A" = 1,
      "B" = 2,
      "C" = 3,
      "D" = 4
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "values"
  ) |>
  filter(!is.na(values)) |>
  ggplot(aes(x = values)) +
  geom_histogram(bins = 20) +
  facet_wrap(~ variable, ncol = 4, scales = "free") +
  labs(
    title = "Distribution of Data",
    x = "Values",
    y = "Count"
  )

We see a good amount of data with normal distributions. But we also see some that variables that stand out:

  • Bailing, Bailing Lvl and Density are bimodal. To a lesser extent, so is Air Pressurer.
  • Filler Speed is very right skewed, but has a crop of suspicious values at the low end.
  • Hyd Pressure1, Hyd Pressure2, and Hyd Pressure3 all have a very large deposit of 0 values. Similarly, Mnf Flow has a large deposit of negative values.
  • Oxygen Filler is extremely left skewed.
  • Bowl Setpoint and Pressure Setpoint look as if they are comprised of discrete values.

One Hot Encoding

Before we continue exploring, we are going to convert our categorical variable (Brand Code) to multiple variables using one hot encoding. We are doing it now because we are about to examine correlation and we want to include Brand Code in that study. We will store the results in train_data_pp, which will ultimately be our preprocessed training data. We will apply the same transformation to test_data so that we can use it for prediction later.

model <- dummyVars(~ ., data = train_data)
train_data_pp <- predict(model, newdata = train_data) |>
  as.data.frame()
test_data_pp <- predict(model, newdata = test_data) |>
  as.data.frame()

Correlation

We are using a correlation matrix to help us better understand the relationships between predictor variables. We are leaving the target (PH) in the dataset so that we may learn of predictors with which it has correlation. And so that we don’t remove predictors with which it has strong relationships.

cor_matrix <- cor(train_data_pp, use = "pairwise.complete.obs")
corrplot::corrplot(cor_matrix, order = "hclust", tl.cex = 0.7)

And let’s identify those that we may want to remove because they are highly correlated. We will list variables with \(\ge .85\) correlation.

cor_matrix |>
  as.data.frame() |>
  rownames_to_column("variable") |>
  pivot_longer(
    -variable,
    names_to = "correlation",
    values_to = "value"
  ) |>
  mutate(
    value = round(value, 3),
    abs_value = abs(value)
  ) |>
  filter(
    variable != correlation & abs_value >= 0.85
  ) |>
  arrange(desc(abs_value)) |>
  kable()
variable correlation value abs_value
Balling Balling Lvl 0.978 0.978
Balling Lvl Balling 0.978 0.978
Density Balling 0.955 0.955
Balling Density 0.955 0.955
Filler Level Bowl Setpoint 0.949 0.949
Bowl Setpoint Filler Level 0.949 0.949
Density Balling Lvl 0.948 0.948
Balling Lvl Density 0.948 0.948
Filler Speed MFR 0.930 0.930
MFR Filler Speed 0.930 0.930
Alch Rel Balling Lvl 0.926 0.926
Balling Lvl Alch Rel 0.926 0.926
Hyd Pressure2 Hyd Pressure3 0.925 0.925
Hyd Pressure3 Hyd Pressure2 0.925 0.925
Balling Alch Rel 0.925 0.925
Alch Rel Balling 0.925 0.925
Density Alch Rel 0.903 0.903
Alch Rel Density 0.903 0.903
Brand CodeD Alch Rel 0.899 0.899
Alch Rel Brand CodeD 0.899 0.899

Linear View

We saw some weird distributions in our data. And in our summary statistics we saw that some of our variables had a lot of variance. We also saw that some of our variables had a lot of missing values. We are going to take a look at the data in a different way. We are going to plot it as a line plot. But our data is not a time series. Nonetheless, we think that it’s safe to assume that the observations were made over time. So, we will try plotting it with a line plot and see what patterns, if any, surface. We would like to emphasize that in no way are we suggesting that this should be interpreted as a time series. We are simply viewing it through the lens of a line plot.

We will use pivot_longer() to reshape the data so that we can plot it one variable stacked on top of another. We will also use row_number() to create an index for the x-axis. We will use facet_wrap() to create a separate plot for each variable.

train_data_pp |>
  mutate(
    index = row_number()
  ) |>
  relocate(
    index,
    .before = everything()
  ) |>
  pivot_longer(
    cols = -1,
    names_to = "variable",
    values_to = "values"
  ) |>
  ggplot(aes(x = index, y = values)) +
  geom_line() +
  facet_wrap(~ variable, ncol = 1, scales = "free") +
  labs(
    title = "Line Plot of Variables",
    x = "Index",
    y = "Values"
  )

We see a lot of several interesting patterns.

  1. We see that Mnf Flow, Hyd Pressure1, Hyd Pressure2, and Hyd Pressure3 have a lot of consecutive observations with the same value. It is very likely that these are the values we were seeing in our histograms.
  2. Carb Flow’s pattern is consistent from 0 to 2000 at which point it drops significantly.
  3. Carb Pressure1 exhibits what looks like two different modes of operation: 0 to ~1350 and then ~1350 to the end. The “modes” are separated by what appears to be a small gap.
  4. Oxygen Filler also appears to have two modes of operation.
  5. Usage Cont may be the oddest variable of all of them. It exhibits what looks like a noisy pattern up until ~1600 at which point it radically goes high and stays there in a noisy pattern.
  6. PH’s trend line is flat, but it clearly has a daily cycle. Just kidding (though it does almost look like timeseries data).

PCA Study

We employed Principal Component Analysis (PCA) as a dimensionality reduction technique to better understand the structure of the predictor space and to address potential multicollinearity among variables. This will also give us our first glimpse at important predictors and related predictors, and we may recognize these patterns in our models, as well.

Given that the final modeling objective is to predict pH, PCA was applied strictly to the predictor variables (i.e., the input features), with the target variable excluded during the transformation phase.

Methodology

The PCA was conducted using the caret and tidyverse packages in R. The procedure included the following key steps:

  1. Standardization: Each feature was Yeo-Johnson transformed and center-scaled to unit variance to ensure that PCA was not biased toward features with larger scales.
# Exclude target
predictors <- train_data_pp |>
  select(-PH) |>
  select(where(is.numeric))

# Preprocess: transform and standardize
predictors_prep <- preProcess(predictors,
                              method = c("YeoJohnson", "center",
                                         "scale", "medianImpute"))

# Transform data using PCA
processed_data <- predict(predictors_prep, predictors)
  1. PCA Transformation: Principal components were extracted from the standardized predictor matrix.
# Perform PCA
pca_result <- prcomp(processed_data, center = FALSE, scale. = FALSE)
  1. Component Retention: The number of components to retain was informed by a combination of cumulative variance explained and visual inspection via a scree plot.
# Extract the variance explained from the PCA model
var_explained <- summary(pca_result)$importance[2,]

# Create the scree plot data frame. Need to factor PCs so they populate the
# plot correctly
scree_df <- data.frame(
  PC = factor(paste0("PC", 1:length(var_explained)), levels = paste0("PC", 1:length(var_explained))),
  Variance = var_explained
)

# Filter to simplify visualization
scree_df_small <- scree_df[1:15, ]

# Plot
ggplot(scree_df_small, aes(x = PC, y = Variance)) +
  geom_smooth(aes(group = 1), color = "darkblue", linewidth = 1, se = FALSE) +
  ggtitle("Scree Plot") +
  xlab("Principal Components") +
  ylab("Proportion of Variance Explained") +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Results and Interpretation

We can analyze the graph above as a heuristic approach for determining the number of Principle Components to keep. The proportion of variance explained starts to taper off at PC6, so we would consider the first five PCs as the most important for modeling. Components beyond this contribute marginally to the variance. Furthermore, we can examine the cumulative variance for each PC we add.

cumulative_variance <- summary(pca_result)$importance[3,]

scree_df$Cumulative_Variance <- cumulative_variance

# Print table
scree_df |>
  as_tibble() |>
  filter(row_number() < 16) |>
  kable()
PC Variance Cumulative_Variance
PC1 0.20922 0.20922
PC2 0.16138 0.37060
PC3 0.07006 0.44066
PC4 0.05272 0.49337
PC5 0.04738 0.54075
PC6 0.04389 0.58464
PC7 0.04048 0.62512
PC8 0.03662 0.66174
PC9 0.03474 0.69648
PC10 0.03118 0.72766
PC11 0.02893 0.75659
PC12 0.02553 0.78212
PC13 0.02519 0.80731
PC14 0.02249 0.82981
PC15 0.02216 0.85196

From this output, we observe that:

The first few components capture a substantial portion of the total variance.

Loadings and Interpretability

The rotation matrix provides the loadings of each original variable on the principal components. Loadings close to ±1 indicate strong influence, while values near 0 indicate minimal contribution.

pca_result$rotation[,1:5] |>
  as.data.frame() |>
  arrange(desc(abs(PC1)))
##                              PC1          PC2          PC3          PC4
## `Alch Rel`           0.351917110  0.018637006  0.015947645 -0.021509091
## Balling              0.350242357 -0.038655554  0.010768884 -0.051510765
## `Balling Lvl`        0.347587056  0.018036174  0.058019203 -0.060589501
## Density              0.342556769  0.003647184 -0.002780885 -0.008220558
## `Carb Rel`           0.327287394  0.040790855  0.033190393 -0.055873034
## `Carb Volume`        0.312671225 -0.010849528  0.077849440  0.014418396
## `Brand Code`D        0.309509885  0.024613794 -0.030268525  0.036554770
## `Brand Code`B       -0.281886397 -0.018590616 -0.081590948  0.205154803
## `Hyd Pressure4`     -0.216438550 -0.010538108  0.259170039 -0.190256519
## `Carb Pressure`      0.185499136  0.002969658  0.036219506  0.321843460
## `Brand Code`A        0.102964974 -0.011611054  0.125500991 -0.121825258
## `Pressure Setpoint` -0.097974310 -0.253715282  0.097041709  0.017762720
## Temperature         -0.085593182  0.073031414  0.238809759 -0.191529526
## `Brand Code`C       -0.080867400  0.007252593  0.040045941 -0.239356190
## `PC Volume`         -0.070027461  0.044363458 -0.300849404 -0.245631858
## `Fill Pressure`     -0.055001282 -0.243216724 -0.006425723  0.067604723
## `Air Pressurer`     -0.041131334  0.025209215  0.098674448  0.289146621
## `Hyd Pressure2`      0.034053272 -0.356389157 -0.234508889 -0.128558268
## `Hyd Pressure3`      0.033357301 -0.371170916 -0.209235867 -0.097374419
## `Oxygen Filler`     -0.033055156  0.197534661 -0.049257586 -0.014325367
## `Pressure Vacuum`   -0.032736654  0.268309580 -0.008011089  0.199000520
## `Fill Ounces`       -0.031395082  0.003712749  0.119973827  0.148610705
## PSC                 -0.027814800 -0.009774838 -0.058858968 -0.040354692
## `Mnf Flow`           0.024485822 -0.370246452  0.074307585 -0.015327357
## `PSC CO2`           -0.024235975 -0.017887178  0.046661968  0.026701097
## `Carb Temp`          0.023503641  0.009142603 -0.001680053  0.344901533
## `Filler Speed`       0.023311543 -0.069554166 -0.401263373  0.258147919
## `Usage cont`         0.020636458 -0.235304698  0.168029336  0.032977648
## `Bowl Setpoint`      0.019544205  0.293735628 -0.180089269 -0.265432969
## MFR                  0.019144953  0.014987724 -0.105029957  0.157232038
## `Hyd Pressure1`      0.015014704 -0.293051774 -0.294238611 -0.220849777
## `Carb Flow`         -0.012178379  0.078728206 -0.353299056  0.232580515
## `Filler Level`       0.009614800  0.297688391 -0.140515049 -0.270964806
## `PSC Fill`          -0.008729149  0.020514009  0.031714068 -0.013912536
## `Carb Pressure1`     0.006908869 -0.148667635  0.370498100  0.012035389
##                               PC5
## `Alch Rel`          -0.0333321552
## Balling              0.0012498277
## `Balling Lvl`        0.0180168311
## Density              0.0169572074
## `Carb Rel`           0.0042927136
## `Carb Volume`       -0.1089099266
## `Brand Code`D       -0.0990504254
## `Brand Code`B       -0.1349954459
## `Hyd Pressure4`      0.1670885478
## `Carb Pressure`      0.4985690147
## `Brand Code`A        0.1690738919
## `Pressure Setpoint`  0.0888629693
## Temperature          0.2261451810
## `Brand Code`C        0.1686322704
## `PC Volume`          0.2243499384
## `Fill Pressure`      0.0701692476
## `Air Pressurer`     -0.0159749133
## `Hyd Pressure2`      0.0620873110
## `Hyd Pressure3`      0.0648982453
## `Oxygen Filler`      0.0873573939
## `Pressure Vacuum`   -0.0831161172
## `Fill Ounces`       -0.0698995664
## PSC                 -0.0053777634
## `Mnf Flow`          -0.0175941414
## `PSC CO2`            0.0701332070
## `Carb Temp`          0.6153834505
## `Filler Speed`      -0.1399971362
## `Usage cont`        -0.1300730605
## `Bowl Setpoint`      0.0579414336
## MFR                 -0.1818287529
## `Hyd Pressure1`      0.1129535985
## `Carb Flow`          0.0651455166
## `Filler Level`       0.0526725738
## `PSC Fill`           0.0004411254
## `Carb Pressure1`    -0.0783446670

By examining the loading structure:

We can interpret PC1 as a linear combination emphasizing variables such as Alch Rel Balling and Density.

Components with clear thematic groupings (e.g., all chemistry variables, or all environmental sensors) enhance interpretability and suggest latent structures in the data.

Preprocessing

We have discovered much and have some work to do. Some preprocessing we will apply to our training and testing datasets to be used for all regression models. Other preprocessing, such as centering, scaling and PCA will be applied more selectively per model.

First, we have already dealt with our categorical data in train_data_pp and test_data_pp. We shall use them as a starting place for further preprocessing.

We will deal with our missing PH data. We don’t want to train our models to predict NA. So we are going to drop the four rows missing PH values.

train_data_pp <- train_data_pp |>
  filter(!is.na(`PH`))
# test_data_pp is already void of PH values

And now we will remove highly correlated variables. We will use a correlation threshold of 0.95. While 0.75 is a common starting point, retaining more features helps maintain signal for models like PLS, which can internally handle correlated predictors by extracting latent factors.

high_corr <- train_data_pp |>
  select(-PH) |>
  cor(use = "pairwise.complete.obs") |>
  findCorrelation(cutoff = 0.95)
# print the names of the highly correlated variables
glue("Removing columns: {colnames(train_data_pp)[high_corr]}")
## Removing columns: Balling
# remove the highly correlated variables
train_data_pp <- train_data_pp[, -high_corr]
test_data_pp <- test_data_pp[, -high_corr]

For missing data we will impute, but the suggested methods of imputation for different models varies, so we are going to leave that processing to be done at training time.

Our line plots revealed some data that is suspiciously constant over many consecutive observations: Mnf Flow, Hyd Pressure1, Hyd Pressure2, and Hyd Pressure3. We suspect that they are effectively missing data. Do we want to do any of the following?

  1. Leave them as they are.
  2. Remove the variables altogether.
  3. Set the suspicious observations to NA and impute them.

We are going to leave them as they are. We don’t know enough about the data to feel confident in assuming they are missing data. We are going to assume that they are legitimate values.

train_data_pp <- train_data_pp |>
  mutate(
    # convert the NAs encoded as numbers into proper NAs.
    `Mnf Flow` = if_else(`Mnf Flow` < 0, NA, `Mnf Flow`),
    # For those with legitimate 0 values, this is flawed.
    `Hyd Pressure1` = if_else(`Hyd Pressure1` == 0, NA, `Hyd Pressure1`),
    `Hyd Pressure2` = if_else(`Hyd Pressure2` == 0, NA, `Hyd Pressure2`),
    `Hyd Pressure3` = if_else(`Hyd Pressure3` == 0, NA, `Hyd Pressure3`)
  )

Regression Workshop

This is our little workshop of utilities. You will see below in preprocess that we don’t actually partition our data. Our dataset being somewhat small, we have decided to use cross validation to test the performance of our models. We think it is better because it uses different combinations of data to train and test our models. If we partition our training dataset then we are compromising our training and testing in one of two mutually exclusive ways. If we chose to partition it and not use cross validation, then we are training our model with only one snapshot of our data, which could lead to bias. If we chose to partition it and use cross validation, then we are training it with an even smaller dataset. We feel we have everything to gain by not partitioning the data and nothing to lose.

preprocess() is a simple utility that sets the random seed in the hopes that all models will use the same cross validated sequence of data. It extracts y from data, and applies the preprocessing methods to the predictors. It returns a list including the preprocessed data, the preprocess model, and the response variable.

preprocess <- function(data, methods) {
  set.seed(31415)

  # separate the data into predictors and targets
  X <- dplyr::select(data, -PH)
  y <- data$PH

  # preprocess the data
  model <- preProcess(X, method = methods)
  X <- predict(model, X)
  list(
    PPM = model,
    X = X,
    y = y
  )
}

report() reports accuracy with MAE, RMSE and R2. It also reports the most important variables in descending order.

report <- function(model) {
  mae = mean(model$results$MAE)
  rmse = mean(model$results$RMSE)
  r2 = mean(model$results$Rsquared)
  glue("{model$method}: MAE={round(mae, 2)}, RMSE={round(rmse, 2)}, R2={round(r2, 2)}") |>
    print()
  glue("Best tuned parameters: {colnames(model$bestTune)} = {model$bestTune}") |>
    print()
  print(varImp(model))
}

ctrl is our directive to the training process to use cross validation to train and evaluate our models. Most models will use this control method. Some models will use a less intense cross validation method. For example, Random Forest takes a long time to run with our default parameters. We are cutting down on cross validation for Random Forest to 2 folds and 5 repeats.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

Linear Regression

Refined Preprocessing

data <- preprocess(
  train_data_pp,
  method = c("knnImpute", "YeoJohnson", "center", "scale")
)

Train Partial Least Squares (PLS) Model

pls_model <- train(
  x = data$X,
  y = data$y,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl
)
report(pls_model)
## pls: MAE=0.11, RMSE=0.14, R2=0.38
## Best tuned parameters: ncomp = 20
## pls variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                           Overall
## `\\`Mnf Flow\\``           100.00
## `\\`Brand Code\\`C`         80.48
## `\\`Usage cont\\``          64.55
## `\\`Bowl Setpoint\\``       64.05
## `\\`Filler Level\\``        54.85
## `\\`Hyd Pressure3\\``       52.32
## Temperature                 49.47
## `\\`Pressure Setpoint\\``   49.16
## `\\`Hyd Pressure2\\``       46.01
## `\\`Brand Code\\`B`         43.27
## `\\`Fill Pressure\\``       40.36
## `\\`Hyd Pressure1\\``       36.08
## `\\`Pressure Vacuum\\``     35.67
## `\\`Carb Pressure1\\``      33.63
## `\\`Carb Flow\\``           33.53
## `\\`Brand Code\\`D`         30.95
## `\\`Brand Code\\`A`         29.53
## `\\`Oxygen Filler\\``       29.38
## `\\`Hyd Pressure4\\``       23.45
## `\\`Carb Rel\\``            23.19

Train Ordinary Least Squares (OLS) Model

Ordinary Least Squares (OLS) is a benchmark linear modeling method. We’ve already preprocessed the data in a way which suits both PLS and OLS. We’ll determine which of these two models is best among linear regression - to do that let’s fit an OLS model.

ols_model <- train(
  x = data$X,
  y = data$y,
  method = "lm",
  trControl = ctrl
)
report(ols_model)
## lm: MAE=0.1, RMSE=0.13, R2=0.4
## Best tuned parameters: intercept = TRUE
## lm variable importance
## 
##   only 20 most important variables shown (out of 33)
## 
##                           Overall
## `\\`Mnf Flow\\``          100.000
## `\\`Carb Pressure1\\``     61.468
## Temperature                38.469
## `\\`Usage cont\\``         37.312
## `\\`Bowl Setpoint\\``      34.030
## `\\`Brand Code\\`C`        32.573
## `\\`Hyd Pressure3\\``      31.683
## Density                    29.763
## `\\`Pressure Setpoint\\``  26.424
## `\\`Oxygen Filler\\``      26.261
## `\\`Brand Code\\`A`        18.418
## `\\`Alch Rel\\``           16.912
## `\\`Carb Flow\\``          13.997
## `\\`Filler Level\\``       12.102
## `\\`Fill Ounces\\``        11.678
## `\\`PSC CO2\\``            11.650
## `\\`Hyd Pressure2\\``      11.131
## `\\`Carb Rel\\``           10.193
## `\\`PC Volume\\``           9.572
## `\\`PSC Fill\\``            8.811

OLS is interpretable and useful as a baseline. It assumes linear relationships and independence. It may not perform as well as more complex models if predictors are collinear or relationships are nonlinear.

Train Lasso Regression Model

To further assess whether regularization improves performance, models like Lasso, Ridge, and Elastic Net could be explored. These approaches can reduce overfitting and handle correlated predictors more effectively than OLS. In future work or production deployment, it would be advisable to test these variants and compare their performance using the same repeated cross-validation framework.

lasso_model <- train(
  x = data$X,
  y = data$y,
  method = "lasso",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)
report(lasso_model)
## lasso: MAE=0.11, RMSE=0.14, R2=0.37
## Best tuned parameters: fraction = 0.9
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                     Overall
## `Mnf Flow`          100.000
## `Bowl Setpoint`      59.594
## `Filler Level`       51.518
## `Usage cont`         49.516
## `Pressure Setpoint`  46.216
## `Carb Flow`          41.604
## `Brand Code`C        39.243
## `Pressure Vacuum`    26.382
## `Hyd Pressure3`      26.289
## `Hyd Pressure2`      22.913
## `Fill Pressure`      20.184
## `Brand Code`D        16.949
## `Oxygen Filler`      14.012
## `Carb Rel`           12.635
## Temperature          11.965
## `Hyd Pressure1`      11.618
## `Alch Rel`           10.601
## `Hyd Pressure4`      10.396
## `Brand Code`A         4.905
## MFR                   4.545

Train Ridge Regression Model

ridge_model <- train(
  x = data$X,
  y = data$y,
  method = "ridge",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)
report(ridge_model)
## ridge: MAE=0.1, RMSE=0.13, R2=0.4
## Best tuned parameters: lambda = 0.01
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                     Overall
## `Mnf Flow`          100.000
## `Bowl Setpoint`      59.594
## `Filler Level`       51.518
## `Usage cont`         49.516
## `Pressure Setpoint`  46.216
## `Carb Flow`          41.604
## `Brand Code`C        39.243
## `Pressure Vacuum`    26.382
## `Hyd Pressure3`      26.289
## `Hyd Pressure2`      22.913
## `Fill Pressure`      20.184
## `Brand Code`D        16.949
## `Oxygen Filler`      14.012
## `Carb Rel`           12.635
## Temperature          11.965
## `Hyd Pressure1`      11.618
## `Alch Rel`           10.601
## `Hyd Pressure4`      10.396
## `Brand Code`A         4.905
## MFR                   4.545

Train Elastic Net Model

elastic_model <- train(
  x = data$X,
  y = data$y,
  method = "glmnet",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
report(elastic_model)
## glmnet: MAE=0.11, RMSE=0.14, R2=NaN
## Best tuned parameters: alpha = 0.952631578947368
## Best tuned parameters: lambda = 0.000335359611608051
## glmnet variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                     Overall
## `Mnf Flow`          100.000
## `Hyd Pressure3`      62.568
## `Brand Code`C        50.353
## `Bowl Setpoint`      49.733
## `Carb Pressure1`     37.185
## Density              36.771
## `Alch Rel`           32.715
## `Usage cont`         24.122
## Temperature          21.838
## `Pressure Setpoint`  19.312
## `Hyd Pressure2`      16.036
## `Oxygen Filler`      15.373
## `Brand Code`A        14.679
## `Filler Level`       13.963
## `Carb Rel`           10.869
## `Carb Flow`          10.216
## `Carb Volume`         9.254
## `Fill Pressure`       7.366
## `Fill Ounces`         6.815
## `Hyd Pressure1`       6.218

We see a warning after running the model for Elastic net that there were missing values in resampled performance measures. Since our model finished training and we still got valid results, and as we will see below our metrics for Elastic Net are close to the others, we will likely not recommend selecting it as our final model unless it offers clear advantages (which we will see that it does not.)

Comparison of OLS, PLS, and Lasso

After training and comparing five linear modeling approaches — OLS, PLS, Lasso, Ridge, and Elastic Net — we find that performance across all methods is remarkably similar. Each model was evaluated using repeated 10-fold cross-validation, and results were compared using RMSE (Root Mean Squared Error), R-squared, and MAE (Mean Absolute Error).

The best RMSE and best R-squared values were achieved by OLS and Elastic Net, but the differences between all models are minuscule (less than 0.0001 RMSE units). The lowest MAE was from OLS at 0.104241, but again, the margin is extremely small.

Given the similarity in performance, we recommend Ordinary Least Squares (OLS) for this use case:

  • It is simple, interpretable, and fast to compute.
  • It slightly outperformed PLS and Lasso in both MAE and RMSE.
  • Its coefficients can be used directly for business insight and explainability.

Elastic Net showed nearly identical performance but triggered a convergence warning during resampling, suggesting mild instability. If regularization becomes important due to changing data conditions or overfitting concerns in future phases, Lasso or Elastic Net could be revisited.

This modeling exercise confirms that pH can be predicted with consistent accuracy using standard linear models. The choice of model does not significantly alter accuracy, allowing the business to prioritize simplicity and transparency. Implementing the OLS model enables stakeholders to:

  • Understand which production variables most influence pH
  • Monitor predictions in real time using a straightforward model
  • Translate coefficients into actionable guidance for technicians

Non Linear Regression

Refined Preprocessing

data <- preprocess(train_data_pp, c("center", "scale", "medianImpute", "pca"))

KNN Model

k-Nearest Neighbors is a non-parametric method used for classification and regression. It works by finding the k-nearest neighbors to a given point and using their values to predict the value of that point.

knn_model <- train(
  x = data$X,
  y = data$y,
  method = "knn",
  tuneLength = 20,
  trControl = ctrl
)
report(knn_model)
## knn: MAE=0.1, RMSE=0.13, R2=0.45
## Best tuned parameters: k = 9
## loess r-squared variable importance
## 
##       Overall
## PC2  100.0000
## PC1   32.9145
## PC6   30.1517
## PC7   27.3098
## PC17  16.6429
## PC4   13.9365
## PC8   13.1987
## PC20  11.9676
## PC3    9.8122
## PC15   8.7371
## PC5    8.3752
## PC18   7.8995
## PC9    7.2142
## PC16   7.1215
## PC19   6.2098
## PC12   3.1794
## PC13   2.5477
## PC10   1.5031
## PC11   0.5908
## PC14   0.0000

SVM Model

Support Vector Machines (SVM) is a supervised learning algorithm that can be used for classification or regression. It works by finding the hyperplane that best separates the data into different classes. In this case, we are using it for regression and will try three different SVM models: linear, polynomial, and radial.

Linear SVM Model

svml_model <- train(
  x = data$X,
  y = data$y,
  method = "svmLinear",
  tuneLength = 10,
  trControl = ctrl
)
report(svml_model)
## svmLinear: MAE=0.11, RMSE=0.14, R2=0.34
## Best tuned parameters: C = 1
## loess r-squared variable importance
## 
##       Overall
## PC2  100.0000
## PC1   32.9145
## PC6   30.1517
## PC7   27.3098
## PC17  16.6429
## PC4   13.9365
## PC8   13.1987
## PC20  11.9676
## PC3    9.8122
## PC15   8.7371
## PC5    8.3752
## PC18   7.8995
## PC9    7.2142
## PC16   7.1215
## PC19   6.2098
## PC12   3.1794
## PC13   2.5477
## PC10   1.5031
## PC11   0.5908
## PC14   0.0000

Polynomial SVM Model

svmp_model <- train(
  x = data$X,
  y = data$y,
  method = "svmPoly",
  tuneLength = 3,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(svmp_model)
## svmPoly: MAE=0.11, RMSE=0.15, R2=0.34
## Best tuned parameters: degree = 3
## Best tuned parameters: scale = 0.01
## Best tuned parameters: C = 1
## loess r-squared variable importance
## 
##       Overall
## PC2  100.0000
## PC1   32.9145
## PC6   30.1517
## PC7   27.3098
## PC17  16.6429
## PC4   13.9365
## PC8   13.1987
## PC20  11.9676
## PC3    9.8122
## PC15   8.7371
## PC5    8.3752
## PC18   7.8995
## PC9    7.2142
## PC16   7.1215
## PC19   6.2098
## PC12   3.1794
## PC13   2.5477
## PC10   1.5031
## PC11   0.5908
## PC14   0.0000

Radial SVM Model

svmr_model <- train(
  x = data$X,
  y = data$y,
  method = "svmRadial",
  tuneLength = 5,
  trControl = ctrl
)
report(svmr_model)
## svmRadial: MAE=0.09, RMSE=0.12, R2=0.49
## Best tuned parameters: sigma = 0.0322972739788172
## Best tuned parameters: C = 4
## loess r-squared variable importance
## 
##       Overall
## PC2  100.0000
## PC1   32.9145
## PC6   30.1517
## PC7   27.3098
## PC17  16.6429
## PC4   13.9365
## PC8   13.1987
## PC20  11.9676
## PC3    9.8122
## PC15   8.7371
## PC5    8.3752
## PC18   7.8995
## PC9    7.2142
## PC16   7.1215
## PC19   6.2098
## PC12   3.1794
## PC13   2.5477
## PC10   1.5031
## PC11   0.5908
## PC14   0.0000

MARS Model

Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression technique that can be used for both linear and non-linear regression. It works by fitting piece-wise linear functions to the data.

We are using a less intensive cross validation method for MARS because it takes a long time to run with our default parameters. We are cutting down on cross validation to 2 folds and 5 repeats.

grid = expand.grid(
  degree = 1:2,
  nprune = 5:10
)
mars_model <- train(
  x = data$X,
  y = data$y,
  method = "earth",
  tuneGrid = grid,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(mars_model)
## earth: MAE=0.11, RMSE=0.14, R2=0.3
## Best tuned parameters: nprune = 10
## Best tuned parameters: degree = 2
## earth variable importance
## 
##      Overall
## PC2  100.000
## PC1   71.570
## PC6   71.570
## PC7   52.247
## PC17  28.489
## PC18  24.062
## PC13  16.917
## PC11   9.624
## PC15   9.624
## PC16   0.000

Neural Network Model

Neural Networks are a class of models that are inspired by the way the human brain works. They are used for both classification and regression tasks. In this case, we are using it for regression.

As with MARS, we are using a less intensive cross validation method for Neural Networks because it takes a long time to run with our default parameters. We are cutting down on cross validation to 2 folds and 5 repeats.

grid <- expand.grid(
  size = c(2, 4, 6, 8, 10),
  decay = c(0, 0.05, 0.1, 0.15)
)
nn_model <- train(
  x = data$X,
  y = data$y,
  method = "nnet",
  linout = TRUE,
  trace = FALSE,
  maxit = 1000,
  tuneGrid = grid,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(nn_model)
## nnet: MAE=0.11, RMSE=0.17, R2=0.38
## Best tuned parameters: size = 6
## Best tuned parameters: decay = 0.1
## nnet variable importance
## 
##      Overall
## PC20  100.00
## PC6    97.17
## PC2    93.23
## PC19   82.66
## PC7    82.64
## PC16   80.55
## PC17   68.43
## PC4    68.32
## PC1    58.24
## PC18   54.43
## PC10   44.00
## PC13   34.40
## PC15   33.81
## PC8    32.10
## PC5    30.84
## PC3    28.66
## PC11   28.27
## PC14   18.44
## PC12   10.77
## PC9     0.00

Non Linear Summary

Of the non-linear models, the SVM radial model performed best with an R-squared of 0.485, followed closely by KNN with an R-squared of 0.455. Somewhat surprisingly, the MARS model had the lowest R-squared of the non linear bunch of 0.302, while the Neural Network model had an R-squared of 0.384.

And the MAE and RMSE, for the most part, reinforce R-squared’s story. The non-linear champion, SVM radial model, had the lowest MAE of 0.093 and lowest RMSE of 0.125. While, at the other extreme, MARS had the highest MAE of 0.114 and 2nd highest RMSE of 0.144.

Decision Trees

Refined Preprocessing

rpart_data <- preprocess(train_data_pp, "medianImpute")
# column names with backticks are causing issues with the caret package.
rpart_data$X <- clean_names(rpart_data$X)

Recursive Partitioning Model

rpart_model <- train(
  x = rpart_data$X,
  y = rpart_data$y,
  method = "rpart",
  tuneLength = 10,
  trControl = ctrl
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
report(rpart_model)
## rpart: MAE=0.11, RMSE=0.14, R2=0.34
## Best tuned parameters: cp = 0.0136301352354267
## rpart variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                   Overall
## mnf_flow          100.000
## temperature        88.758
## usage_cont         80.417
## filler_speed       76.020
## balling_lvl        69.764
## fill_pressure      51.009
## bowl_setpoint      48.788
## hyd_pressure3      48.601
## filler_level       45.544
## air_pressurer      45.273
## pressure_vacuum    41.667
## brand_code_c       40.155
## carb_flow          36.453
## pressure_setpoint  32.795
## carb_pressure1     32.761
## carb_rel           29.118
## oxygen_filler      28.459
## alch_rel           20.989
## brand_code_d       18.941
## carb_volume         7.517

Visualize the model

rpart.plot(
  rpart_model$finalModel,
  type = 2,
  extra = 101,
  fallen.leaves = TRUE,
  main = "Decision Tree for pH Prediction"
)

The recursive partitioning model was built using the rpart method with 10 levels of complexity parameter (cp) tuning. The optimal cp value found was 0.0136, balancing tree complexity and performance. While the decision tree offers good interpretability, its predictive power was moderate, with an Rsquared of 0.34 indicating only limited explanatory capability for the variance in the response variable. The visualization of the decision tree helps understand how key variables influence the pH prediction.

The top three predictors identified were: mnf_flow (100% importance), brand_code_c (92.93) and pressure_vacuum (88.75).

Random Forest Model

rf_data <- rpart_data

# Random forest takes a very long time to run with our default parameters.
# We are cutting down on cross validation
rf_model <- train(
  x = rf_data$X,
  y = rf_data$y,
  importance = TRUE,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(rf_model)
## rf: MAE=0.08, RMSE=0.11, R2=0.63
## Best tuned parameters: mtry = 18
## rf variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                 Overall
## mnf_flow         100.00
## brand_code_c      92.93
## pressure_vacuum   88.75
## oxygen_filler     77.26
## balling_lvl       65.47
## air_pressurer     61.98
## temperature       59.73
## alch_rel          58.12
## carb_rel          58.11
## usage_cont        54.06
## density           52.87
## hyd_pressure3     51.18
## filler_speed      48.47
## carb_flow         45.49
## carb_pressure1    44.49
## hyd_pressure1     43.02
## bowl_setpoint     40.53
## hyd_pressure2     38.17
## fill_pressure     31.83
## filler_level      30.07

Visualize the model

varImpPlot(
  rf_model$finalModel,
  type = 1,
   main = "Random Forest Variable Importance"
)

The Random Forest outperformed the decision tree, nearly doubling the explanatory power (Rsquared of 0.63). It provided much stronger predictive performance, suggesting it is better suited for this dataset despite longer training time and reduced interpretability.

Top predictors included: mnf_flow (100% importance), consistent with the decision tree.

Model Comparison

Having used cross-validation to build our models, they are now populated with \(results\). Here we shall take a look at some of those summary statistics and compare one model to another.

model_list <- list(
  PLS = pls_model,
  OLS = ols_model,
  Lasso = lasso_model,
  Ridge = ridge_model,
  ElasticNet = elastic_model,
  KNN = knn_model,
  SVM_linear = svml_model,
  SVM_poly = svmp_model,
  SVM_radial = svmr_model,
  MARS = mars_model,
  NeuralNet = nn_model,
  RPart = rpart_model,
  RandomForest = rf_model
)

# extract the results from each model
model_results <- tibble(
  Model = names(model_list),
  MAE = map(
    model_list, function(x) mean(x$results$MAE, na.rm = TRUE)
  ) |> as_vector(),
  RMSE = map(
    model_list, function(x) mean(x$results$RMSE, na.rm = TRUE)
  ) |> as_vector(),
  Rsquared = map(
    model_list, function(x) mean(x$results$Rsquared, na.rm = TRUE)
  ) |> as_vector()
)
model_results |>
  arrange(desc(Rsquared), MAE, RMSE) |>
  mutate(
    MAE = round(MAE, 3),
    RMSE = round(RMSE, 3),
    Rsquared = round(Rsquared, 3)
  ) |>
  kable()
Model MAE RMSE Rsquared
RandomForest 0.080 0.108 0.626
SVM_radial 0.093 0.125 0.485
KNN 0.099 0.128 0.455
OLS 0.104 0.134 0.396
Ridge 0.104 0.134 0.396
NeuralNet 0.109 0.171 0.384
PLS 0.106 0.136 0.380
Lasso 0.108 0.138 0.371
ElasticNet 0.109 0.139 0.370
RPart 0.111 0.141 0.340
SVM_poly 0.113 0.148 0.338
SVM_linear 0.110 0.141 0.337
MARS 0.114 0.144 0.302

The Champion

A champion, ideally, is a model with the lowest MAE, the lowest RMSE, and the highest R2. And we have such a champion. Please allow us to present the champion of our modeling competition: Random Forest. It is simple, moderately interpretable, but not very fast to compute.

Among all tested models, the Random Forest model did demonstrate the best overall performance, achieving the lowest MAE (0.080), lowest RMSE (0.108), and highest r squared (0.626), indicating superior predictive accuracy and explanatory power.

The SVM radial and KNN models followed, with moderate performance (r squared of 0.485 and 0.455, respectively), but did not match the precision of Random Forest. Linear models such as OLS, Ridge, and Elastic Net showed similar results, with r squared values clustered around 0.37–0.40.

The Neural Network, PLS, Lasso, and RPart (Decision Tree) models exhibited slightly weaker performance, with r squared values below 0.40. The MARS model had the lowest performance overall (r squared of 0.302).

In conclusion, Random Forest is the recommended model for this dataset due to its balance of accuracy and robustness, despite being more complex and computationally intensive.

Final Predictions

Finally, we will make predictions using our champion model, Random Forest. We will use the test data to make predictions and save them to a CSV file.

# preprocess the data with the same model with which we preprocessed the training model
df <- predict(rf_data$PPM, test_data_pp)
# the variable names in our training set caused problems with `randomForest` so we
# "cleaned" them. The model is expecting the same variables.
df <- clean_names(df)
data.frame(
  SampleID = 1:nrow(df),
  Predicted_pH = predict(rf_model, newdata = df)
) |>
  write.csv(
    "Final_Predictions.csv",
    row.names = FALSE
  )