Project 2: Team Assignment

Authors

Alex Ptacek

Amanda Knudsen

Curtis Elsasser

Yana Rabkova

Published

May 18, 2025

Assignment

ABC Beverage is conducting an internal review of its manufacturing process with a focus on understanding the factors that influence product pH levels. This technical report highlights our team’s data science-driven approach to modeling and predicting pH using historical production data.

To achieve this, we analyzed the dataset provided by ABC Beverage, performed data cleaning and exploratory analysis, and evaluated several models. These included linear and nonlinear regressions, decision trees, random forests, and more. Through the report we will outline our methodology, modeling process, evaluation strategy, and the final model selected for predicting pH levels.

Setup

library(corrplot)
library(caret)
library(earth)
library(e1071)
library(Formula)
library(GGally)
library(ggpubr)
library(glmnet)
library(glue)
library(janitor)
library(knitr)
library(Metrics)
library(plotmo)
library(plotrix)
library(pls)
library(RANN)
library(randomForest)
library(readr)
library(readxl)
library(rpart)
library(rpart.plot)
library(tidyverse)

Load the Data

First we must load the data. We have been provided two datasets. We presumed that one was designated for training (“StudentData.xlsx”) and the other for testing (“StudentEvaluation.xlsx”). But, we found “StudentEvaluation.xlsx” does not have target variable values. So, we concluded that “StudentData.xlsx” is for both training and testing and that “StudentEvaluation.xlsx” is for our final prediction.

train_data <- read_excel("StudentData.xlsx")
test_data <- read_excel("StudentEvaluation.xlsx")

EDA

George Hagstrom, our professor of DATA 607, defined exploratory data analysis as “…the art of looking at data in a systematic way in order to understand the the underlying structure of the data. It has two main goals: ensure data quality and uncover patterns to guide future analysis. EDA is detective work: you ask and answer questions, which inspires more questions.” And that is exactly what we will do. We are going to get to know our dataset[s] by summarizing it, scrutinizing it, looking at it’s correlation properties, and visualizing it.

Summary Information

We will take a look at a sample of the data. We shall also examine our data’s columns and types, the number of rows and columns, and the first few rows of the data. And finally, we will calculate some summary statistics: the mean, median, min, max, and standard deviation of each column so that we may better understand the nature of our data.

head(train_data)

# A tibble: 6 × 33
  `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
  <chr>                <dbl>         <dbl>       <dbl>           <dbl>
1 B                     5.34          24.0       0.263            68.2
2 A                     5.43          24.0       0.239            68.4
3 B                     5.29          24.1       0.263            70.8
4 A                     5.44          24.0       0.293            63  
5 A                     5.49          24.3       0.111            67.2
6 A                     5.38          23.9       0.269            66.6
# ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
#   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
#   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
#   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
#   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
#   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
#   `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …

str(train_data)

tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
 $ Brand Code       : chr [1:2571] "B" "A" "B" "A" ...
 $ Carb Volume      : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
 $ Fill Ounces      : num [1:2571] 24 24 24.1 24 24.3 ...
 $ PC Volume        : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
 $ Carb Pressure    : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
 $ Carb Temp        : num [1:2571] 141 140 145 133 137 ...
 $ PSC              : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
 $ PSC Fill         : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
 $ PSC CO2          : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
 $ Mnf Flow         : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
 $ Carb Pressure1   : num [1:2571] 119 122 120 115 118 ...
 $ Fill Pressure    : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
 $ Hyd Pressure1    : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
 $ Hyd Pressure2    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
 $ Hyd Pressure3    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
 $ Hyd Pressure4    : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
 $ Filler Level     : num [1:2571] 121 119 120 118 119 ...
 $ Filler Speed     : num [1:2571] 4002 3986 4020 4012 4010 ...
 $ Temperature      : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
 $ Usage cont       : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
 $ Carb Flow        : num [1:2571] 2932 3144 2914 3062 3054 ...
 $ Density          : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
 $ MFR              : num [1:2571] 725 727 735 731 723 ...
 $ Balling          : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
 $ Pressure Vacuum  : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
 $ PH               : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
 $ Oxygen Filler    : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
 $ Bowl Setpoint    : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
 $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
 $ Air Pressurer    : num [1:2571] 143 143 142 146 146 ...
 $ Alch Rel         : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
 $ Carb Rel         : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
 $ Balling Lvl      : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

We can see there are 33 columns in total and 2,571 rows in total. The 33 columns includes the “PH” column which is what we will be aiming to predict – PH will be the response variable in our linear regression exploration.

We can see that we have mostly numeric values. The only character or categorical (non-numeric) is the Brand Code. Based on information in our team’s preferred guidance, by Max Kuhn and Kjell Johnson, on predictive modeling, to deal with non-numeric values (a.k.a. categorical values, of which Brand Code is the only one) the recommendation is to either convert into dummy variables or remove if not informative or the value has too many categories. We have found that it has significant importance in some of our models, we will convert it to a dummy variable before training (since models such as Lasso requires input predictors to be numeric).

We can also see that there are negative values which means we won’t be able to apply the BoxCox method for processing our data. Our guidance and standards, from Applied Predictive Modeling, state that if the data includes negatives we should use the YeoJohnson method instead.

We will also take a look at the summary statistics of our data. We would normally use the summary() function, but it prints wide and not long and does not present well. So, we are going to take a stick shift approach.

df <- select(train_data, -`Brand Code`)
data.frame(
  mean = sapply(df, mean, na.rm = TRUE) |> round(3),
  median = sapply(df, median, na.rm = TRUE) |> round(3),
  min = sapply(df, min, na.rm = TRUE) |> round(3),
  max = sapply(df, max, na.rm = TRUE) |> round(3),
  sd = sapply(df, sd, na.rm = TRUE) |> round(3),
  sd.normal = sapply(df, function(x) {
    x = x[!is.na(x)]
    sd(x) / (max(x) - min(x))
  }) |> round(3)
) |>
  kable()

	mean	median	min	max	sd	sd.normal
Carb Volume	5.370	5.347	5.040	5.700	0.106	0.161
Fill Ounces	23.975	23.973	23.633	24.320	0.088	0.127
PC Volume	0.277	0.271	0.079	0.478	0.061	0.152
Carb Pressure	68.190	68.200	57.000	79.400	3.538	0.158
Carb Temp	141.095	140.800	128.600	154.000	4.037	0.159
PSC	0.085	0.076	0.002	0.270	0.049	0.184
PSC Fill	0.195	0.180	0.000	0.620	0.118	0.190
PSC CO2	0.056	0.040	0.000	0.240	0.043	0.179
Mnf Flow	24.569	65.200	-100.200	229.400	119.481	0.363
Carb Pressure1	122.586	123.200	105.600	140.200	4.743	0.137
Fill Pressure	47.922	46.400	34.600	60.400	3.178	0.123
Hyd Pressure1	12.438	11.400	-0.800	58.000	12.433	0.211
Hyd Pressure2	20.961	28.600	0.000	59.400	16.386	0.276
Hyd Pressure3	20.458	27.600	-1.200	50.000	15.976	0.312
Hyd Pressure4	96.289	96.000	52.000	142.000	13.123	0.146
Filler Level	109.252	118.400	55.800	161.200	15.698	0.149
Filler Speed	3687.199	3982.000	998.000	4030.000	770.820	0.254
Temperature	65.968	65.600	63.600	76.200	1.383	0.110
Usage cont	20.993	21.790	12.080	25.900	2.978	0.215
Carb Flow	2468.354	3028.000	26.000	5104.000	1073.696	0.211
Density	1.174	0.980	0.240	1.920	0.378	0.225
MFR	704.049	724.000	31.400	868.600	73.898	0.088
Balling	2.198	1.648	-0.170	4.012	0.931	0.223
Pressure Vacuum	-5.216	-5.400	-6.600	-3.600	0.570	0.190
PH	8.546	8.540	7.880	9.360	0.173	0.117
Oxygen Filler	0.047	0.033	0.002	0.400	0.047	0.117
Bowl Setpoint	109.327	120.000	70.000	140.000	15.303	0.219
Pressure Setpoint	47.615	46.000	44.000	52.000	2.039	0.255
Air Pressurer	142.834	142.600	140.800	148.200	1.212	0.164
Alch Rel	6.897	6.560	5.280	8.620	0.505	0.151
Carb Rel	5.437	5.400	4.960	6.060	0.129	0.117
Balling Lvl	2.050	1.480	0.000	3.660	0.870	0.238

Looking at the variance, we see that MFR is very low. On the other hand, Mnf Flow and Hyd Pressure3 have relatively high amounts of variance.

Missing Values (`NA`)

Let’s quantify how much data is missing from our data.

count_missing <- function(data) {
  data |>
    summarise(across(everything(), ~ sum(is.na(.)))) |>
    pivot_longer(
      everything(),
      names_to = "variable",
      values_to = "missing"
    ) |>
    filter(missing > 0) |>
    mutate(
      ratio = missing / nrow(data)
    ) |>
    arrange(desc(missing))
}

count_missing(train_data) |>
  kable()

variable	missing	ratio
MFR	212	0.0824582
Brand Code	120	0.0466744
Filler Speed	57	0.0221704
PC Volume	39	0.0151692
PSC CO2	39	0.0151692
Fill Ounces	38	0.0147802
PSC	33	0.0128355
Carb Pressure1	32	0.0124465
Hyd Pressure4	30	0.0116686
Carb Pressure	27	0.0105018
Carb Temp	26	0.0101128
PSC Fill	23	0.0089459
Fill Pressure	22	0.0085570
Filler Level	20	0.0077791
Hyd Pressure2	15	0.0058343
Hyd Pressure3	15	0.0058343
Temperature	14	0.0054454
Oxygen Filler	12	0.0046674
Pressure Setpoint	12	0.0046674
Hyd Pressure1	11	0.0042785
Carb Volume	10	0.0038895
Carb Rel	10	0.0038895
Alch Rel	9	0.0035006
Usage cont	5	0.0019448
PH	4	0.0015558
Mnf Flow	2	0.0007779
Carb Flow	2	0.0007779
Bowl Setpoint	2	0.0007779
Density	1	0.0003890
Balling	1	0.0003890
Balling Lvl	1	0.0003890

count_missing(test_data) |>
  kable()

variable	missing	ratio
PH	267	1.0000000
MFR	31	0.1161049
Filler Speed	10	0.0374532
Brand Code	8	0.0299625
Fill Ounces	6	0.0224719
PSC	5	0.0187266
PSC CO2	5	0.0187266
PC Volume	4	0.0149813
Carb Pressure1	4	0.0149813
Hyd Pressure4	4	0.0149813
PSC Fill	3	0.0112360
Oxygen Filler	3	0.0112360
Alch Rel	3	0.0112360
Fill Pressure	2	0.0074906
Filler Level	2	0.0074906
Temperature	2	0.0074906
Usage cont	2	0.0074906
Pressure Setpoint	2	0.0074906
Carb Rel	2	0.0074906
Carb Volume	1	0.0037453
Carb Temp	1	0.0037453
Hyd Pressure2	1	0.0037453
Hyd Pressure3	1	0.0037453
Density	1	0.0037453
Balling	1	0.0037453
Pressure Vacuum	1	0.0037453
Bowl Setpoint	1	0.0037453
Air Pressurer	1	0.0037453

As we can see, there are a fair number of missing values. PH’s four missing values may not be topping the charts; nonetheless, it may be the most problematic. Guidance to handle this scenario is to remove the rows where the PH is null from the training and test sets so that would include removing the rows in the predictor and response sets. We will explain more on this when we get to that step prior to training our model.

Distribution

We are temporarily recoding Brand Code to be numeric so that we can pivot our dataset to be long. And we have a handle on missing values, so we are filtering them out.

train_data |>
  mutate(
    `Brand Code` = recode(
      `Brand Code`,
      "A" = 1,
      "B" = 2,
      "C" = 3,
      "D" = 4
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "values"
  ) |>
  filter(!is.na(values)) |>
  ggplot(aes(x = values)) +
  geom_histogram(bins = 20) +
  facet_wrap(~ variable, ncol = 4, scales = "free") +
  labs(
    title = "Distribution of Data",
    x = "Values",
    y = "Count"
  )

We see a good amount of data with normal distributions. But we also see some that variables that stand out:

Bailing, Bailing Lvl and Density are bimodal. To a lesser extent, so is Air Pressurer.
Filler Speed is very right skewed, but has a crop of suspicious values at the low end.
Hyd Pressure1, Hyd Pressure2, and Hyd Pressure3 all have a very large deposit of 0 values. Similarly, Mnf Flow has a large deposit of negative values.
Oxygen Filler is extremely left skewed.
Bowl Setpoint and Pressure Setpoint look as if they are comprised of discrete values.

One Hot Encoding

Before we continue exploring, we are going to convert our categorical variable (Brand Code) to multiple variables using one hot encoding. We are doing it now because we are about to examine correlation and we want to include Brand Code in that study. We will store the results in train_data_pp, which will ultimately be our preprocessed training data. We will apply the same transformation to test_data so that we can use it for prediction later.

model <- dummyVars(~ ., data = train_data)
train_data_pp <- predict(model, newdata = train_data) |>
  as.data.frame()
test_data_pp <- predict(model, newdata = test_data) |>
  as.data.frame()

Correlation

We are using a correlation matrix to help us better understand the relationships between predictor variables. We are leaving the target (PH) in the dataset so that we may learn of predictors with which it has correlation. And so that we don’t remove predictors with which it has strong relationships.

cor_matrix <- cor(train_data_pp, use = "pairwise.complete.obs")
corrplot::corrplot(cor_matrix, order = "hclust", tl.cex = 0.7)

And let’s identify those that we may want to remove because they are highly correlated. We will list variables with \(\ge .85\) correlation.

cor_matrix |>
  as.data.frame() |>
  rownames_to_column("variable") |>
  pivot_longer(
    -variable,
    names_to = "correlation",
    values_to = "value"
  ) |>
  mutate(
    value = round(value, 3),
    abs_value = abs(value)
  ) |>
  filter(
    variable != correlation & abs_value >= 0.85
  ) |>
  arrange(desc(abs_value)) |>
  kable()

variable	correlation	value	abs_value
Balling	`Balling Lvl`	0.978	0.978
`Balling Lvl`	Balling	0.978	0.978
Density	Balling	0.955	0.955
Balling	Density	0.955	0.955
`Filler Level`	`Bowl Setpoint`	0.949	0.949
`Bowl Setpoint`	`Filler Level`	0.949	0.949
Density	`Balling Lvl`	0.948	0.948
`Balling Lvl`	Density	0.948	0.948
`Filler Speed`	MFR	0.930	0.930
MFR	`Filler Speed`	0.930	0.930
`Alch Rel`	`Balling Lvl`	0.926	0.926
`Balling Lvl`	`Alch Rel`	0.926	0.926
`Hyd Pressure2`	`Hyd Pressure3`	0.925	0.925
`Hyd Pressure3`	`Hyd Pressure2`	0.925	0.925
Balling	`Alch Rel`	0.925	0.925
`Alch Rel`	Balling	0.925	0.925
Density	`Alch Rel`	0.903	0.903
`Alch Rel`	Density	0.903	0.903
`Brand Code`D	`Alch Rel`	0.899	0.899
`Alch Rel`	`Brand Code`D	0.899	0.899

Linear View

We saw some weird distributions in our data. And in our summary statistics we saw that some of our variables had a lot of variance. We also saw that some of our variables had a lot of missing values. We are going to take a look at the data in a different way. We are going to plot it as a line plot. But our data is not a time series. Nonetheless, we think that it’s safe to assume that the observations were made over time. So, we will try plotting it with a line plot and see what patterns, if any, surface. We would like to emphasize that in no way are we suggesting that this should be interpreted as a time series. We are simply viewing it through the lens of a line plot.

We will use pivot_longer() to reshape the data so that we can plot it one variable stacked on top of another. We will also use row_number() to create an index for the x-axis. We will use facet_wrap() to create a separate plot for each variable.

train_data_pp |>
  mutate(
    index = row_number()
  ) |>
  relocate(
    index,
    .before = everything()
  ) |>
  pivot_longer(
    cols = -1,
    names_to = "variable",
    values_to = "values"
  ) |>
  ggplot(aes(x = index, y = values)) +
  geom_line() +
  facet_wrap(~ variable, ncol = 1, scales = "free") +
  labs(
    title = "Line Plot of Variables",
    x = "Index",
    y = "Values"
  )

We see a lot of several interesting patterns.

We see that Mnf Flow, Hyd Pressure1, Hyd Pressure2, and Hyd Pressure3 have a lot of consecutive observations with the same value. It is very likely that these are the values we were seeing in our histograms.
Carb Flow’s pattern is consistent from 0 to 2000 at which point it drops significantly.
Carb Pressure1 exhibits what looks like two different modes of operation: 0 to ~1350 and then ~1350 to the end. The “modes” are separated by what appears to be a small gap.
Oxygen Filler also appears to have two modes of operation.
Usage Cont may be the oddest variable of all of them. It exhibits what looks like a noisy pattern up until ~1600 at which point it radically goes high and stays there in a noisy pattern.
PH’s trend line is flat, but it clearly has a daily cycle. Just kidding (though it does almost look like timeseries data).

PCA Study

We employed Principal Component Analysis (PCA) as a dimensionality reduction technique to better understand the structure of the predictor space and to address potential multicollinearity among variables. This will also give us our first glimpse at important predictors and related predictors, and we may recognize these patterns in our models, as well.

Given that the final modeling objective is to predict pH, PCA was applied strictly to the predictor variables (i.e., the input features), with the target variable excluded during the transformation phase.

Methodology

The PCA was conducted using the caret and tidyverse packages in R. The procedure included the following key steps:

Standardization: Each feature was Yeo-Johnson transformed and center-scaled to unit variance to ensure that PCA was not biased toward features with larger scales.

# Exclude target
predictors <- train_data_pp |>
  select(-PH) |>
  select(where(is.numeric))

# Preprocess: transform and standardize
predictors_prep <- preProcess(predictors,
                              method = c("YeoJohnson", "center",
                                         "scale", "medianImpute"))

# Transform data using PCA
processed_data <- predict(predictors_prep, predictors)

PCA Transformation: Principal components were extracted from the standardized predictor matrix.

# Perform PCA
pca_result <- prcomp(processed_data, center = FALSE, scale. = FALSE)

Component Retention: The number of components to retain was informed by a combination of cumulative variance explained and visual inspection via a scree plot.

# Extract the variance explained from the PCA model
var_explained <- summary(pca_result)$importance[2,]

# Create the scree plot data frame. Need to factor PCs so they populate the
# plot correctly
scree_df <- data.frame(
  PC = factor(paste0("PC", 1:length(var_explained)), levels = paste0("PC", 1:length(var_explained))),
  Variance = var_explained
)

# Filter to simplify visualization
scree_df_small <- scree_df[1:15, ]

# Plot
ggplot(scree_df_small, aes(x = PC, y = Variance)) +
  geom_smooth(aes(group = 1), color = "darkblue", linewidth = 1, se = FALSE) +
  ggtitle("Scree Plot") +
  xlab("Principal Components") +
  ylab("Proportion of Variance Explained") +
  theme_minimal()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Results and Interpretation

We can analyze the graph above as a heuristic approach for determining the number of Principle Components to keep. The proportion of variance explained starts to taper off at PC6, so we would consider the first five PCs as the most important for modeling. Components beyond this contribute marginally to the variance. Furthermore, we can examine the cumulative variance for each PC we add.

cumulative_variance <- summary(pca_result)$importance[3,]

scree_df$Cumulative_Variance <- cumulative_variance

# Print table
scree_df |>
  as_tibble() |>
  filter(row_number() < 16) |>
  kable()

PC	Variance	Cumulative_Variance
PC1	0.20922	0.20922
PC2	0.16138	0.37060
PC3	0.07006	0.44066
PC4	0.05272	0.49337
PC5	0.04738	0.54075
PC6	0.04389	0.58464
PC7	0.04048	0.62512
PC8	0.03662	0.66174
PC9	0.03474	0.69648
PC10	0.03118	0.72766
PC11	0.02893	0.75659
PC12	0.02553	0.78212
PC13	0.02519	0.80731
PC14	0.02249	0.82981
PC15	0.02216	0.85196

From this output, we observe that:

The first few components capture a substantial portion of the total variance.

Loadings and Interpretability

The rotation matrix provides the loadings of each original variable on the principal components. Loadings close to ±1 indicate strong influence, while values near 0 indicate minimal contribution.

pca_result$rotation[,1:5] |>
  as.data.frame() |>
  arrange(desc(abs(PC1)))

                             PC1          PC2          PC3          PC4
`Alch Rel`           0.351917110  0.018637006  0.015947644 -0.021509088
Balling              0.350242358 -0.038655553  0.010768884 -0.051510764
`Balling Lvl`        0.347587056  0.018036175  0.058019202 -0.060589500
Density              0.342556769  0.003647185 -0.002780885 -0.008220555
`Carb Rel`           0.327287394  0.040790856  0.033190392 -0.055873032
`Carb Volume`        0.312671225 -0.010849527  0.077849439  0.014418401
`Brand Code`D        0.309509885  0.024613795 -0.030268525  0.036554775
`Brand Code`B       -0.281886397 -0.018590616 -0.081590948  0.205154812
`Hyd Pressure4`     -0.216438550 -0.010538108  0.259170039 -0.190256529
`Carb Pressure`      0.185499135  0.002969656  0.036219512  0.321843440
`Brand Code`A        0.102964974 -0.011611054  0.125500991 -0.121825267
`Pressure Setpoint` -0.097974309 -0.253715282  0.097041709  0.017762706
Temperature         -0.085593182  0.073031414  0.238809759 -0.191529539
`Brand Code`C       -0.080867400  0.007252593  0.040045941 -0.239356201
`PC Volume`         -0.070027461  0.044363458 -0.300849404 -0.245631862
`Fill Pressure`     -0.055001281 -0.243216724 -0.006425722  0.067604714
`Air Pressurer`     -0.041131334  0.025209215  0.098674449  0.289146623
`Hyd Pressure2`      0.034053272 -0.356389157 -0.234508890 -0.128558271
`Hyd Pressure3`      0.033357301 -0.371170916 -0.209235867 -0.097374423
`Oxygen Filler`     -0.033055156  0.197534661 -0.049257586 -0.014325367
`Pressure Vacuum`   -0.032736654  0.268309580 -0.008011089  0.199000523
`Fill Ounces`       -0.031395082  0.003712749  0.119973827  0.148610711
PSC                 -0.027814800 -0.009774838 -0.058858968 -0.040354680
`Mnf Flow`           0.024485822 -0.370246451  0.074307585 -0.015327355
`PSC CO2`           -0.024235975 -0.017887178  0.046661968  0.026701094
`Carb Temp`          0.023503629  0.009142586 -0.001680039  0.344901502
`Filler Speed`       0.023311543 -0.069554166 -0.401263372  0.258147929
`Usage cont`         0.020636458 -0.235304698  0.168029336  0.032977655
`Bowl Setpoint`      0.019544205  0.293735628 -0.180089270 -0.265432969
MFR                  0.019144953  0.014987724 -0.105029956  0.157232046
`Hyd Pressure1`      0.015014705 -0.293051774 -0.294238612 -0.220849783
`Carb Flow`         -0.012178379  0.078728205 -0.353299056  0.232580514
`Filler Level`       0.009614800  0.297688391 -0.140515050 -0.270964806
`PSC Fill`          -0.008729149  0.020514009  0.031714068 -0.013912533
`Carb Pressure1`     0.006908870 -0.148667635  0.370498100  0.012035387
                              PC5
`Alch Rel`          -0.0333321566
Balling              0.0012498256
`Balling Lvl`        0.0180168286
Density              0.0169572081
`Carb Rel`           0.0042927144
`Carb Volume`       -0.1089099295
`Brand Code`D       -0.0990504201
`Brand Code`B       -0.1349954273
`Hyd Pressure4`      0.1670885311
`Carb Pressure`      0.4985690417
`Brand Code`A        0.1690738712
`Pressure Setpoint`  0.0888629414
Temperature          0.2261451642
`Brand Code`C        0.1686322556
`PC Volume`          0.2243499330
`Fill Pressure`      0.0701692254
`Air Pressurer`     -0.0159749120
`Hyd Pressure2`      0.0620873076
`Hyd Pressure3`      0.0648982399
`Oxygen Filler`      0.0873573908
`Pressure Vacuum`   -0.0831161164
`Fill Ounces`       -0.0698995498
PSC                 -0.0053777417
`Mnf Flow`          -0.0175941336
`PSC CO2`            0.0701332075
`Carb Temp`          0.6153834830
`Filler Speed`      -0.1399971150
`Usage cont`        -0.1300730429
`Bowl Setpoint`      0.0579414364
MFR                 -0.1818287321
`Hyd Pressure1`      0.1129535858
`Carb Flow`          0.0651455076
`Filler Level`       0.0526725782
`PSC Fill`           0.0004411326
`Carb Pressure1`    -0.0783446680

By examining the loading structure:

We can interpret PC1 as a linear combination emphasizing variables such as Alch Rel Balling and Density.

Components with clear thematic groupings (e.g., all chemistry variables, or all environmental sensors) enhance interpretability and suggest latent structures in the data.

Preprocessing

We have discovered much and have some work to do. Some preprocessing we will apply to our training and testing datasets to be used for all regression models. Other preprocessing, such as centering, scaling and PCA will be applied more selectively per model.

We have a fair amount of correlation.
We have NA’s in our target variable.
We have predictors with varying amounts of missing data.
We have a categorical variable.
And we have some data that is not NA, but looks as if it should be.

First, we have already dealt with our categorical data in train_data_pp and test_data_pp. We shall use them as a starting place for further preprocessing.

We will deal with our missing PH data. We don’t want to train our models to predict NA. So we are going to drop the four rows missing PH values.

train_data_pp <- train_data_pp |>
  filter(!is.na(`PH`))
# test_data_pp is already void of PH values

And now we will remove highly correlated variables. We will use a correlation threshold of 0.95. While 0.75 is a common starting point, retaining more features helps maintain signal for models like PLS, which can internally handle correlated predictors by extracting latent factors.

high_corr <- train_data_pp |>
  select(-PH) |>
  cor(use = "pairwise.complete.obs") |>
  findCorrelation(cutoff = 0.95)
# print the names of the highly correlated variables
glue("Removing columns: {colnames(train_data_pp)[high_corr]}")

Removing columns: Balling

# remove the highly correlated variables
train_data_pp <- train_data_pp[, -high_corr]
test_data_pp <- test_data_pp[, -high_corr]

For missing data we will impute, but the suggested methods of imputation for different models varies, so we are going to leave that processing to be done at training time.

Our line plots revealed some data that is suspiciously constant over many consecutive observations: Mnf Flow, Hyd Pressure1, Hyd Pressure2, and Hyd Pressure3. We suspect that they are effectively missing data. Do we want to do any of the following?

Leave them as they are.
Remove the variables altogether.
Set the suspicious observations to NA and impute them.

We are going to leave them as they are. We don’t know enough about the data to feel confident in assuming they are missing data. We are going to assume that they are legitimate values.

train_data_pp <- train_data_pp |>
  mutate(
    # convert the NAs encoded as numbers into proper NAs.
    `Mnf Flow` = if_else(`Mnf Flow` < 0, NA, `Mnf Flow`),
    # For those with legitimate 0 values, this is flawed.
    `Hyd Pressure1` = if_else(`Hyd Pressure1` == 0, NA, `Hyd Pressure1`),
    `Hyd Pressure2` = if_else(`Hyd Pressure2` == 0, NA, `Hyd Pressure2`),
    `Hyd Pressure3` = if_else(`Hyd Pressure3` == 0, NA, `Hyd Pressure3`)
  )

Regression Workshop

This is our little workshop of utilities. You will see below in preprocess that we don’t actually partition our data. Our dataset being somewhat small, we have decided to use cross validation to test the performance of our models. We think it is better because it uses different combinations of data to train and test our models. If we partition our training dataset then we are compromising our training and testing in one of two mutually exclusive ways. If we chose to partition it and not use cross validation, then we are training our model with only one snapshot of our data, which could lead to bias. If we chose to partition it and use cross validation, then we are training it with an even smaller dataset. We feel we have everything to gain by not partitioning the data and nothing to lose.

preprocess() is a simple utility that sets the random seed in the hopes that all models will use the same cross validated sequence of data. It extracts y from data, and applies the preprocessing methods to the predictors. It returns a list including the preprocessed data, the preprocess model, and the response variable.

preprocess <- function(data, methods) {
  set.seed(31415)

  # separate the data into predictors and targets
  X <- dplyr::select(data, -PH)
  y <- data$PH

  # preprocess the data
  model <- preProcess(X, method = methods)
  X <- predict(model, X)
  list(
    PPM = model,
    X = X,
    y = y
  )
}

report() reports accuracy with MAE, RMSE and R2. It also reports the most important variables in descending order.

report <- function(model) {
  mae = mean(model$results$MAE)
  rmse = mean(model$results$RMSE)
  r2 = mean(model$results$Rsquared)
  glue("{model$method}: MAE={round(mae, 2)}, RMSE={round(rmse, 2)}, R2={round(r2, 2)}") |>
    print()
  glue("Best tuned parameters: {colnames(model$bestTune)} = {model$bestTune}") |>
    print()
  print(varImp(model))
}

ctrl is our directive to the training process to use cross validation to train and evaluate our models. Most models will use this control method. Some models will use a less intense cross validation method. For example, Random Forest takes a long time to run with our default parameters. We are cutting down on cross validation for Random Forest to 2 folds and 5 repeats.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

Linear Regression

Refined Preprocessing

data <- preprocess(
  train_data_pp,
  method = c("knnImpute", "YeoJohnson", "center", "scale")
)

Train Partial Least Squares (PLS) Model

pls_model <- train(
  x = data$X,
  y = data$y,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl
)
report(pls_model)

pls: MAE=0.11, RMSE=0.14, R2=0.38
Best tuned parameters: ncomp = 20
pls variable importance

  only 20 most important variables shown (out of 34)

                          Overall
`\\`Mnf Flow\\``           100.00
`\\`Brand Code\\`C`         80.48
`\\`Usage cont\\``          64.55
`\\`Bowl Setpoint\\``       64.05
`\\`Filler Level\\``        54.85
`\\`Hyd Pressure3\\``       52.32
Temperature                 49.47
`\\`Pressure Setpoint\\``   49.16
`\\`Hyd Pressure2\\``       46.01
`\\`Brand Code\\`B`         43.27
`\\`Fill Pressure\\``       40.36
`\\`Hyd Pressure1\\``       36.08
`\\`Pressure Vacuum\\``     35.67
`\\`Carb Pressure1\\``      33.63
`\\`Carb Flow\\``           33.53
`\\`Brand Code\\`D`         30.95
`\\`Brand Code\\`A`         29.53
`\\`Oxygen Filler\\``       29.38
`\\`Hyd Pressure4\\``       23.45
`\\`Carb Rel\\``            23.19

Train Ordinary Least Squares (OLS) Model

Ordinary Least Squares (OLS) is a benchmark linear modeling method. We’ve already preprocessed the data in a way which suits both PLS and OLS. We’ll determine which of these two models is best among linear regression - to do that let’s fit an OLS model.

ols_model <- train(
  x = data$X,
  y = data$y,
  method = "lm",
  trControl = ctrl
)
report(ols_model)

lm: MAE=0.1, RMSE=0.13, R2=0.4
Best tuned parameters: intercept = TRUE
lm variable importance

  only 20 most important variables shown (out of 33)

                          Overall
`\\`Mnf Flow\\``          100.000
`\\`Carb Pressure1\\``     61.468
Temperature                38.469
`\\`Usage cont\\``         37.312
`\\`Bowl Setpoint\\``      34.030
`\\`Brand Code\\`C`        32.573
`\\`Hyd Pressure3\\``      31.683
Density                    29.763
`\\`Pressure Setpoint\\``  26.424
`\\`Oxygen Filler\\``      26.261
`\\`Brand Code\\`A`        18.418
`\\`Alch Rel\\``           16.912
`\\`Carb Flow\\``          13.997
`\\`Filler Level\\``       12.102
`\\`Fill Ounces\\``        11.678
`\\`PSC CO2\\``            11.650
`\\`Hyd Pressure2\\``      11.131
`\\`Carb Rel\\``           10.193
`\\`PC Volume\\``           9.572
`\\`PSC Fill\\``            8.811

OLS is interpretable and useful as a baseline. It assumes linear relationships and independence. It may not perform as well as more complex models if predictors are collinear or relationships are nonlinear.

Train Lasso Regression Model

To further assess whether regularization improves performance, models like Lasso, Ridge, and Elastic Net could be explored. These approaches can reduce overfitting and handle correlated predictors more effectively than OLS. In future work or production deployment, it would be advisable to test these variants and compare their performance using the same repeated cross-validation framework.

lasso_model <- train(
  x = data$X,
  y = data$y,
  method = "lasso",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)
report(lasso_model)

lasso: MAE=0.11, RMSE=0.14, R2=0.37
Best tuned parameters: fraction = 0.857894736842105
loess r-squared variable importance

  only 20 most important variables shown (out of 34)

                    Overall
`Mnf Flow`          100.000
`Bowl Setpoint`      59.594
`Filler Level`       51.518
`Usage cont`         49.516
`Pressure Setpoint`  46.216
`Carb Flow`          41.604
`Brand Code`C        39.243
`Pressure Vacuum`    26.382
`Hyd Pressure3`      26.289
`Hyd Pressure2`      22.913
`Fill Pressure`      20.184
`Brand Code`D        16.949
`Oxygen Filler`      14.012
`Carb Rel`           12.635
Temperature          11.965
`Hyd Pressure1`      11.618
`Alch Rel`           10.601
`Hyd Pressure4`      10.396
`Brand Code`A         4.905
MFR                   4.545

Train Ridge Regression Model

ridge_model <- train(
  x = data$X,
  y = data$y,
  method = "ridge",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)
report(ridge_model)

ridge: MAE=0.1, RMSE=0.13, R2=0.4
Best tuned parameters: lambda = 0.01
loess r-squared variable importance

  only 20 most important variables shown (out of 34)

                    Overall
`Mnf Flow`          100.000
`Bowl Setpoint`      59.594
`Filler Level`       51.518
`Usage cont`         49.516
`Pressure Setpoint`  46.216
`Carb Flow`          41.604
`Brand Code`C        39.243
`Pressure Vacuum`    26.382
`Hyd Pressure3`      26.289
`Hyd Pressure2`      22.913
`Fill Pressure`      20.184
`Brand Code`D        16.949
`Oxygen Filler`      14.012
`Carb Rel`           12.635
Temperature          11.965
`Hyd Pressure1`      11.618
`Alch Rel`           10.601
`Hyd Pressure4`      10.396
`Brand Code`A         4.905
MFR                   4.545

Train Elastic Net Model

elastic_model <- train(
  x = data$X,
  y = data$y,
  method = "glmnet",
  tuneLength = 20,
  trControl = ctrl,
  preProcess = NULL
)

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

report(elastic_model)

glmnet: MAE=0.11, RMSE=0.14, R2=NaN
Best tuned parameters: alpha = 0.952631578947368
Best tuned parameters: lambda = 0.000335359611608049
glmnet variable importance

  only 20 most important variables shown (out of 34)

                    Overall
`Mnf Flow`          100.000
`Hyd Pressure3`      62.568
`Brand Code`C        50.353
`Bowl Setpoint`      49.733
`Carb Pressure1`     37.185
Density              36.771
`Alch Rel`           32.715
`Usage cont`         24.122
Temperature          21.838
`Pressure Setpoint`  19.312
`Hyd Pressure2`      16.036
`Oxygen Filler`      15.373
`Brand Code`A        14.679
`Filler Level`       13.963
`Carb Rel`           10.869
`Carb Flow`          10.216
`Carb Volume`         9.254
`Fill Pressure`       7.366
`Fill Ounces`         6.815
`Hyd Pressure1`       6.218

We see a warning after running the model for Elastic net that there were missing values in resampled performance measures. Since our model finished training and we still got valid results, and as we will see below our metrics for Elastic Net are close to the others, we will likely not recommend selecting it as our final model unless it offers clear advantages (which we will see that it does not.)

Comparison of OLS, PLS, and Lasso

After training and comparing five linear modeling approaches — OLS, PLS, Lasso, Ridge, and Elastic Net — we find that performance across all methods is remarkably similar. Each model was evaluated using repeated 10-fold cross-validation, and results were compared using RMSE (Root Mean Squared Error), R-squared, and MAE (Mean Absolute Error).

The best RMSE and best R-squared values were achieved by OLS and Elastic Net, but the differences between all models are minuscule (less than 0.0001 RMSE units). The lowest MAE was from OLS at 0.104241, but again, the margin is extremely small.

Given the similarity in performance, we recommend Ordinary Least Squares (OLS) for this use case:

It is simple, interpretable, and fast to compute.
It slightly outperformed PLS and Lasso in both MAE and RMSE.
Its coefficients can be used directly for business insight and explainability.

Elastic Net showed nearly identical performance but triggered a convergence warning during resampling, suggesting mild instability. If regularization becomes important due to changing data conditions or overfitting concerns in future phases, Lasso or Elastic Net could be revisited.

This modeling exercise confirms that pH can be predicted with consistent accuracy using standard linear models. The choice of model does not significantly alter accuracy, allowing the business to prioritize simplicity and transparency. Implementing the OLS model enables stakeholders to:

Understand which production variables most influence pH
Monitor predictions in real time using a straightforward model
Translate coefficients into actionable guidance for technicians

Non Linear Regression

Refined Preprocessing

data <- preprocess(train_data_pp, c("center", "scale", "medianImpute", "pca"))

KNN Model

k-Nearest Neighbors is a non-parametric method used for classification and regression. It works by finding the k-nearest neighbors to a given point and using their values to predict the value of that point.

knn_model <- train(
  x = data$X,
  y = data$y,
  method = "knn",
  tuneLength = 20,
  trControl = ctrl
)
report(knn_model)

knn: MAE=0.1, RMSE=0.13, R2=0.45
Best tuned parameters: k = 9
loess r-squared variable importance

      Overall
PC2  100.0000
PC1   32.9145
PC6   30.1517
PC7   27.3098
PC17  16.6429
PC4   13.9365
PC8   13.1987
PC20  11.9676
PC3    9.8122
PC15   8.7371
PC5    8.3752
PC18   7.8995
PC9    7.2142
PC16   7.1215
PC19   6.2098
PC12   3.1794
PC13   2.5477
PC10   1.5031
PC11   0.5908
PC14   0.0000

SVM Model

Support Vector Machines (SVM) is a supervised learning algorithm that can be used for classification or regression. It works by finding the hyperplane that best separates the data into different classes. In this case, we are using it for regression and will try three different SVM models: linear, polynomial, and radial.

Linear SVM Model

svml_model <- train(
  x = data$X,
  y = data$y,
  method = "svmLinear",
  tuneLength = 10,
  trControl = ctrl
)
report(svml_model)

svmLinear: MAE=0.11, RMSE=0.14, R2=0.34
Best tuned parameters: C = 1
loess r-squared variable importance

      Overall
PC2  100.0000
PC1   32.9145
PC6   30.1517
PC7   27.3098
PC17  16.6429
PC4   13.9365
PC8   13.1987
PC20  11.9676
PC3    9.8122
PC15   8.7371
PC5    8.3752
PC18   7.8995
PC9    7.2142
PC16   7.1215
PC19   6.2098
PC12   3.1794
PC13   2.5477
PC10   1.5031
PC11   0.5908
PC14   0.0000

Polynomial SVM Model

svmp_model <- train(
  x = data$X,
  y = data$y,
  method = "svmPoly",
  tuneLength = 3,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(svmp_model)

svmPoly: MAE=0.11, RMSE=0.15, R2=0.34
Best tuned parameters: degree = 3
Best tuned parameters: scale = 0.01
Best tuned parameters: C = 1
loess r-squared variable importance

      Overall
PC2  100.0000
PC1   32.9145
PC6   30.1517
PC7   27.3098
PC17  16.6429
PC4   13.9365
PC8   13.1987
PC20  11.9676
PC3    9.8122
PC15   8.7371
PC5    8.3752
PC18   7.8995
PC9    7.2142
PC16   7.1215
PC19   6.2098
PC12   3.1794
PC13   2.5477
PC10   1.5031
PC11   0.5908
PC14   0.0000

Radial SVM Model

svmr_model <- train(
  x = data$X,
  y = data$y,
  method = "svmRadial",
  tuneLength = 5,
  trControl = ctrl
)
report(svmr_model)

svmRadial: MAE=0.09, RMSE=0.12, R2=0.49
Best tuned parameters: sigma = 0.0322972739788174
Best tuned parameters: C = 4
loess r-squared variable importance

      Overall
PC2  100.0000
PC1   32.9145
PC6   30.1517
PC7   27.3098
PC17  16.6429
PC4   13.9365
PC8   13.1987
PC20  11.9676
PC3    9.8122
PC15   8.7371
PC5    8.3752
PC18   7.8995
PC9    7.2142
PC16   7.1215
PC19   6.2098
PC12   3.1794
PC13   2.5477
PC10   1.5031
PC11   0.5908
PC14   0.0000

MARS Model

Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression technique that can be used for both linear and non-linear regression. It works by fitting piece-wise linear functions to the data.

We are using a less intensive cross validation method for MARS because it takes a long time to run with our default parameters. We are cutting down on cross validation to 2 folds and 5 repeats.

grid = expand.grid(
  degree = 1:2,
  nprune = 5:10
)
mars_model <- train(
  x = data$X,
  y = data$y,
  method = "earth",
  tuneGrid = grid,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(mars_model)

earth: MAE=0.11, RMSE=0.14, R2=0.3
Best tuned parameters: nprune = 10
Best tuned parameters: degree = 2
earth variable importance

     Overall
PC2  100.000
PC1   71.570
PC6   71.570
PC7   52.247
PC17  28.489
PC18  24.062
PC13  16.917
PC11   9.624
PC15   9.624
PC16   0.000

Neural Network Model

Neural Networks are a class of models that are inspired by the way the human brain works. They are used for both classification and regression tasks. In this case, we are using it for regression.

As with MARS, we are using a less intensive cross validation method for Neural Networks because it takes a long time to run with our default parameters. We are cutting down on cross validation to 2 folds and 5 repeats.

grid <- expand.grid(
  size = c(2, 4, 6, 8, 10),
  decay = c(0, 0.05, 0.1, 0.15)
)
nn_model <- train(
  x = data$X,
  y = data$y,
  method = "nnet",
  linout = TRUE,
  trace = FALSE,
  maxit = 1000,
  tuneGrid = grid,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(nn_model)

nnet: MAE=0.11, RMSE=0.17, R2=0.38
Best tuned parameters: size = 6
Best tuned parameters: decay = 0.1
nnet variable importance

     Overall
PC20  100.00
PC6    97.17
PC2    93.23
PC19   82.67
PC7    82.65
PC16   80.56
PC17   68.44
PC4    68.33
PC1    58.24
PC18   54.43
PC10   44.01
PC13   34.40
PC15   33.81
PC8    32.10
PC5    30.84
PC3    28.67
PC11   28.28
PC14   18.44
PC12   10.78
PC9     0.00

Non Linear Summary

Of the non-linear models, the SVM radial model performed best with an R-squared of 0.485, followed closely by KNN with an R-squared of 0.455. Somewhat surprisingly, the MARS model had the lowest R-squared of the non linear bunch of 0.302, while the Neural Network model had an R-squared of 0.384.

And the MAE and RMSE, for the most part, reinforce R-squared’s story. The non-linear champion, SVM radial model, had the lowest MAE of 0.093 and lowest RMSE of 0.125. While, at the other extreme, MARS had the highest MAE of 0.114 and 2nd highest RMSE of 0.144.

Decision Trees

Refined Preprocessing

rpart_data <- preprocess(train_data_pp, "medianImpute")
# column names with backticks are causing issues with the caret package.
rpart_data$X <- clean_names(rpart_data$X)

Recursive Partitioning Model

rpart_model <- train(
  x = rpart_data$X,
  y = rpart_data$y,
  method = "rpart",
  tuneLength = 10,
  trControl = ctrl
)

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

report(rpart_model)

rpart: MAE=0.11, RMSE=0.14, R2=0.34
Best tuned parameters: cp = 0.0136301352354267
rpart variable importance

  only 20 most important variables shown (out of 34)

                  Overall
mnf_flow          100.000
temperature        88.758
usage_cont         80.417
filler_speed       76.020
balling_lvl        69.764
fill_pressure      51.009
bowl_setpoint      48.788
hyd_pressure3      48.601
filler_level       45.544
air_pressurer      45.273
pressure_vacuum    41.667
brand_code_c       40.155
carb_flow          36.453
pressure_setpoint  32.795
carb_pressure1     32.761
carb_rel           29.118
oxygen_filler      28.459
alch_rel           20.989
brand_code_d       18.941
carb_volume         7.517

Visualize the model

rpart.plot(
  rpart_model$finalModel,
  type = 2,
  extra = 101,
  fallen.leaves = TRUE,
  main = "Decision Tree for pH Prediction"
)

The recursive partitioning model was built using the rpart method with 10 levels of complexity parameter (cp) tuning. The optimal cp value found was 0.0136, balancing tree complexity and performance. While the decision tree offers good interpretability, its predictive power was moderate, with an Rsquared of 0.34 indicating only limited explanatory capability for the variance in the response variable. The visualization of the decision tree helps understand how key variables influence the pH prediction.

The top three predictors identified were: mnf_flow (100% importance), brand_code_c (92.93) and pressure_vacuum (88.75).

Random Forest Model

rf_data <- rpart_data

# Random forest takes a very long time to run with our default parameters.
# We are cutting down on cross validation
rf_model <- train(
  x = rf_data$X,
  y = rf_data$y,
  importance = TRUE,
  trControl = trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5
  )
)
report(rf_model)

rf: MAE=0.08, RMSE=0.11, R2=0.63
Best tuned parameters: mtry = 18
rf variable importance

  only 20 most important variables shown (out of 34)

                Overall
mnf_flow         100.00
brand_code_c      92.93
pressure_vacuum   88.75
oxygen_filler     77.26
balling_lvl       65.47
air_pressurer     61.98
temperature       59.73
alch_rel          58.12
carb_rel          58.11
usage_cont        54.06
density           52.87
hyd_pressure3     51.18
filler_speed      48.47
carb_flow         45.49
carb_pressure1    44.49
hyd_pressure1     43.02
bowl_setpoint     40.53
hyd_pressure2     38.17
fill_pressure     31.83
filler_level      30.07

Visualize the model

varImpPlot(
  rf_model$finalModel,
  type = 1,
   main = "Random Forest Variable Importance"
)

The Random Forest outperformed the decision tree, nearly doubling the explanatory power (Rsquared of 0.63). It provided much stronger predictive performance, suggesting it is better suited for this dataset despite longer training time and reduced interpretability.

Top predictors included: mnf_flow (100% importance), consistent with the decision tree.

Model Comparison

Having used cross-validation to build our models, they are now populated with \(results\). Here we shall take a look at some of those summary statistics and compare one model to another.

model_list <- list(
  PLS = pls_model,
  OLS = ols_model,
  Lasso = lasso_model,
  Ridge = ridge_model,
  ElasticNet = elastic_model,
  KNN = knn_model,
  SVM_linear = svml_model,
  SVM_poly = svmp_model,
  SVM_radial = svmr_model,
  MARS = mars_model,
  NeuralNet = nn_model,
  RPart = rpart_model,
  RandomForest = rf_model
)

# extract the results from each model
model_results <- tibble(
  Model = names(model_list),
  MAE = map(
    model_list, function(x) mean(x$results$MAE, na.rm = TRUE)
  ) |> as_vector(),
  RMSE = map(
    model_list, function(x) mean(x$results$RMSE, na.rm = TRUE)
  ) |> as_vector(),
  Rsquared = map(
    model_list, function(x) mean(x$results$Rsquared, na.rm = TRUE)
  ) |> as_vector()
)

model_results |>
  arrange(desc(Rsquared), MAE, RMSE) |>
  mutate(
    MAE = round(MAE, 3),
    RMSE = round(RMSE, 3),
    Rsquared = round(Rsquared, 3)
  ) |>
  kable()

Model	MAE	RMSE	Rsquared
RandomForest	0.080	0.108	0.626
SVM_radial	0.093	0.125	0.485
KNN	0.099	0.128	0.455
OLS	0.104	0.134	0.396
Ridge	0.104	0.134	0.396
NeuralNet	0.108	0.172	0.381
PLS	0.106	0.136	0.380
ElasticNet	0.109	0.139	0.370
Lasso	0.108	0.138	0.370
RPart	0.111	0.141	0.340
SVM_poly	0.113	0.148	0.338
SVM_linear	0.110	0.141	0.337
MARS	0.114	0.144	0.302

The Champion

A champion, ideally, is a model with the lowest MAE, the lowest RMSE, and the highest R2. And we have such a champion. Please allow us to present the champion of our modeling competition: Random Forest. It is simple, moderately interpretable, but not very fast to compute.

Among all tested models, the Random Forest model did demonstrate the best overall performance, achieving the lowest MAE (0.080), lowest RMSE (0.108), and highest r squared (0.626), indicating superior predictive accuracy and explanatory power.

The SVM radial and KNN models followed, with moderate performance (r squared of 0.485 and 0.455, respectively), but did not match the precision of Random Forest. Linear models such as OLS, Ridge, and Elastic Net showed similar results, with r squared values clustered around 0.37–0.40.

The Neural Network, PLS, Lasso, and RPart (Decision Tree) models exhibited slightly weaker performance, with r squared values below 0.40. The MARS model had the lowest performance overall (r squared of 0.302).

In conclusion, Random Forest is the recommended model for this dataset due to its balance of accuracy and robustness, despite being more complex and computationally intensive.

Final Predictions

Finally, we will make predictions using our champion model, Random Forest. We will use the test data to make predictions and save them to a CSV file.

# preprocess the data with the same model with which we preprocessed the training model
df <- predict(rf_data$PPM, test_data_pp)
# the variable names in our training set caused problems with `randomForest` so we
# "cleaned" them. The model is expecting the same variables.
df <- clean_names(df)
data.frame(
  SampleID = 1:nrow(df),
  Predicted_pH = predict(rf_model, newdata = df)
) |>
  write.csv(
    "Final_Predictions.csv",
    row.names = FALSE
  )

Assignment

Setup

Load the Data

EDA

Summary Information

Missing Values (NA)

Distribution

One Hot Encoding

Correlation

Linear View

PCA Study

Methodology

Results and Interpretation

Loadings and Interpretability

Preprocessing

Regression Workshop

Linear Regression

Refined Preprocessing

Train Partial Least Squares (PLS) Model

Train Ordinary Least Squares (OLS) Model

Train Lasso Regression Model

Train Ridge Regression Model

Train Elastic Net Model

Comparison of OLS, PLS, and Lasso

Non Linear Regression

Refined Preprocessing

KNN Model

SVM Model

Linear SVM Model

Polynomial SVM Model

Radial SVM Model

MARS Model

Neural Network Model

Non Linear Summary

Decision Trees

Refined Preprocessing

Recursive Partitioning Model

Random Forest Model

Model Comparison

The Champion

Final Predictions

Missing Values (`NA`)