ABC Beverage is conducting an internal review of its manufacturing process with a focus on understanding the factors that influence product pH levels. This technical report highlights our team’s data science-driven approach to modeling and predicting pH using historical production data.
To achieve this, we analyzed the dataset provided by ABC Beverage, performed data cleaning and exploratory analysis, and evaluated several models. These included linear and nonlinear regressions, decision trees, random forests, and more. Through the report we will outline our methodology, modeling process, evaluation strategy, and the final model selected for predicting pH levels.
library(corrplot)
library(caret)
library(earth)
library(e1071)
library(Formula)
library(GGally)
library(ggpubr)
library(glmnet)
library(glue)
library(janitor)
library(knitr)
library(Metrics)
library(plotmo)
library(plotrix)
library(pls)
library(RANN)
library(randomForest)
library(readr)
library(readxl)
library(rpart)
library(rpart.plot)
library(tidyverse)
First we must load the data. We have been provided two datasets. We presumed that one was designated for training (“StudentData.xlsx”) and the other for testing (“StudentEvaluation.xlsx”). But, we found “StudentEvaluation.xlsx” does not have target variable values. So, we concluded that “StudentData.xlsx” is for both training and testing and that “StudentEvaluation.xlsx” is for our final prediction.
train_data <- read_excel("StudentData.xlsx")
test_data <- read_excel("StudentEvaluation.xlsx")
George Hagstrom, our professor of DATA 607, defined exploratory data analysis as “…the art of looking at data in a systematic way in order to understand the the underlying structure of the data. It has two main goals: ensure data quality and uncover patterns to guide future analysis. EDA is detective work: you ask and answer questions, which inspires more questions.” And that is exactly what we will do. We are going to get to know our dataset\[s\] by summarizing it, scrutinizing it, looking at it’s correlation properties, and visualizing it.
We will take a look at a sample of the data. We shall also examine our data’s columns and types, the number of rows and columns, and the first few rows of the data. And finally, we will calculate some summary statistics: the mean, median, min, max, and standard deviation of each column so that we may better understand the nature of our data.
head(train_data)
## # A tibble: 6 × 33
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B 5.34 24.0 0.263 68.2
## 2 A 5.43 24.0 0.239 68.4
## 3 B 5.29 24.1 0.263 70.8
## 4 A 5.44 24.0 0.293 63
## 5 A 5.49 24.3 0.111 67.2
## 6 A 5.38 23.9 0.269 66.6
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## # `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## # `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## # `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## # `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## # `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## # `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …
str(train_data)
## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
## $ Brand Code : chr [1:2571] "B" "A" "B" "A" ...
## $ Carb Volume : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill Ounces : num [1:2571] 24 24 24.1 24 24.3 ...
## $ PC Volume : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb Pressure : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb Temp : num [1:2571] 141 140 145 133 137 ...
## $ PSC : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC Fill : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
## $ PSC CO2 : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
## $ Mnf Flow : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num [1:2571] 119 122 120 115 118 ...
## $ Fill Pressure : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd Pressure1 : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler Level : num [1:2571] 121 119 120 118 119 ...
## $ Filler Speed : num [1:2571] 4002 3986 4020 4012 4010 ...
## $ Temperature : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage cont : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb Flow : num [1:2571] 2932 3144 2914 3062 3054 ...
## $ Density : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num [1:2571] 725 727 735 731 723 ...
## $ Balling : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure Vacuum : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen Filler : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl Setpoint : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air Pressurer : num [1:2571] 143 143 142 146 146 ...
## $ Alch Rel : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb Rel : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling Lvl : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
We can see there are 33 columns in total and 2,571 rows in total. The 33 columns includes the “PH” column which is what we will be aiming to predict – PH will be the response variable in our linear regression exploration.
We can see that we have mostly numeric values. The only character or
categorical (non-numeric) is the Brand Code
. Based on
information in our team’s preferred guidance, by Max Kuhn and Kjell
Johnson, on predictive modeling, to deal with non-numeric values (a.k.a.
categorical values, of which Brand Code is the only one) the
recommendation is to either convert into dummy variables or remove if
not informative or the value has too many categories. We have found that
it has significant importance in some of our models, we will convert it
to a dummy variable before training (since models such as Lasso requires
input predictors to be numeric).
We can also see that there are negative values which means we won’t be able to apply the BoxCox method for processing our data. Our guidance and standards, from Applied Predictive Modeling, state that if the data includes negatives we should use the YeoJohnson method instead.
We will also take a look at the summary statistics of our data. We
would normally use the summary()
function, but it prints
wide and not long and does not present well. So, we are going to take a
stick shift approach.
df <- select(train_data, -`Brand Code`)
data.frame(
mean = sapply(df, mean, na.rm = TRUE) |> round(3),
median = sapply(df, median, na.rm = TRUE) |> round(3),
min = sapply(df, min, na.rm = TRUE) |> round(3),
max = sapply(df, max, na.rm = TRUE) |> round(3),
sd = sapply(df, sd, na.rm = TRUE) |> round(3),
sd.normal = sapply(df, function(x) {
x = x[!is.na(x)]
sd(x) / (max(x) - min(x))
}) |> round(3)
) |>
kable()
mean | median | min | max | sd | sd.normal | |
---|---|---|---|---|---|---|
Carb Volume | 5.370 | 5.347 | 5.040 | 5.700 | 0.106 | 0.161 |
Fill Ounces | 23.975 | 23.973 | 23.633 | 24.320 | 0.088 | 0.127 |
PC Volume | 0.277 | 0.271 | 0.079 | 0.478 | 0.061 | 0.152 |
Carb Pressure | 68.190 | 68.200 | 57.000 | 79.400 | 3.538 | 0.158 |
Carb Temp | 141.095 | 140.800 | 128.600 | 154.000 | 4.037 | 0.159 |
PSC | 0.085 | 0.076 | 0.002 | 0.270 | 0.049 | 0.184 |
PSC Fill | 0.195 | 0.180 | 0.000 | 0.620 | 0.118 | 0.190 |
PSC CO2 | 0.056 | 0.040 | 0.000 | 0.240 | 0.043 | 0.179 |
Mnf Flow | 24.569 | 65.200 | -100.200 | 229.400 | 119.481 | 0.363 |
Carb Pressure1 | 122.586 | 123.200 | 105.600 | 140.200 | 4.743 | 0.137 |
Fill Pressure | 47.922 | 46.400 | 34.600 | 60.400 | 3.178 | 0.123 |
Hyd Pressure1 | 12.438 | 11.400 | -0.800 | 58.000 | 12.433 | 0.211 |
Hyd Pressure2 | 20.961 | 28.600 | 0.000 | 59.400 | 16.386 | 0.276 |
Hyd Pressure3 | 20.458 | 27.600 | -1.200 | 50.000 | 15.976 | 0.312 |
Hyd Pressure4 | 96.289 | 96.000 | 52.000 | 142.000 | 13.123 | 0.146 |
Filler Level | 109.252 | 118.400 | 55.800 | 161.200 | 15.698 | 0.149 |
Filler Speed | 3687.199 | 3982.000 | 998.000 | 4030.000 | 770.820 | 0.254 |
Temperature | 65.968 | 65.600 | 63.600 | 76.200 | 1.383 | 0.110 |
Usage cont | 20.993 | 21.790 | 12.080 | 25.900 | 2.978 | 0.215 |
Carb Flow | 2468.354 | 3028.000 | 26.000 | 5104.000 | 1073.696 | 0.211 |
Density | 1.174 | 0.980 | 0.240 | 1.920 | 0.378 | 0.225 |
MFR | 704.049 | 724.000 | 31.400 | 868.600 | 73.898 | 0.088 |
Balling | 2.198 | 1.648 | -0.170 | 4.012 | 0.931 | 0.223 |
Pressure Vacuum | -5.216 | -5.400 | -6.600 | -3.600 | 0.570 | 0.190 |
PH | 8.546 | 8.540 | 7.880 | 9.360 | 0.173 | 0.117 |
Oxygen Filler | 0.047 | 0.033 | 0.002 | 0.400 | 0.047 | 0.117 |
Bowl Setpoint | 109.327 | 120.000 | 70.000 | 140.000 | 15.303 | 0.219 |
Pressure Setpoint | 47.615 | 46.000 | 44.000 | 52.000 | 2.039 | 0.255 |
Air Pressurer | 142.834 | 142.600 | 140.800 | 148.200 | 1.212 | 0.164 |
Alch Rel | 6.897 | 6.560 | 5.280 | 8.620 | 0.505 | 0.151 |
Carb Rel | 5.437 | 5.400 | 4.960 | 6.060 | 0.129 | 0.117 |
Balling Lvl | 2.050 | 1.480 | 0.000 | 3.660 | 0.870 | 0.238 |
Looking at the variance, we see that MFR
is very low. On
the other hand, Mnf Flow
and Hyd Pressure3
have relatively high amounts of variance.
NA
)Let’s quantify how much data is missing from our data.
count_missing <- function(data) {
data |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(
everything(),
names_to = "variable",
values_to = "missing"
) |>
filter(missing > 0) |>
mutate(
ratio = missing / nrow(data)
) |>
arrange(desc(missing))
}
count_missing(train_data) |>
kable()
variable | missing | ratio |
---|---|---|
MFR | 212 | 0.0824582 |
Brand Code | 120 | 0.0466744 |
Filler Speed | 57 | 0.0221704 |
PC Volume | 39 | 0.0151692 |
PSC CO2 | 39 | 0.0151692 |
Fill Ounces | 38 | 0.0147802 |
PSC | 33 | 0.0128355 |
Carb Pressure1 | 32 | 0.0124465 |
Hyd Pressure4 | 30 | 0.0116686 |
Carb Pressure | 27 | 0.0105018 |
Carb Temp | 26 | 0.0101128 |
PSC Fill | 23 | 0.0089459 |
Fill Pressure | 22 | 0.0085570 |
Filler Level | 20 | 0.0077791 |
Hyd Pressure2 | 15 | 0.0058343 |
Hyd Pressure3 | 15 | 0.0058343 |
Temperature | 14 | 0.0054454 |
Oxygen Filler | 12 | 0.0046674 |
Pressure Setpoint | 12 | 0.0046674 |
Hyd Pressure1 | 11 | 0.0042785 |
Carb Volume | 10 | 0.0038895 |
Carb Rel | 10 | 0.0038895 |
Alch Rel | 9 | 0.0035006 |
Usage cont | 5 | 0.0019448 |
PH | 4 | 0.0015558 |
Mnf Flow | 2 | 0.0007779 |
Carb Flow | 2 | 0.0007779 |
Bowl Setpoint | 2 | 0.0007779 |
Density | 1 | 0.0003890 |
Balling | 1 | 0.0003890 |
Balling Lvl | 1 | 0.0003890 |
count_missing(test_data) |>
kable()
variable | missing | ratio |
---|---|---|
PH | 267 | 1.0000000 |
MFR | 31 | 0.1161049 |
Filler Speed | 10 | 0.0374532 |
Brand Code | 8 | 0.0299625 |
Fill Ounces | 6 | 0.0224719 |
PSC | 5 | 0.0187266 |
PSC CO2 | 5 | 0.0187266 |
PC Volume | 4 | 0.0149813 |
Carb Pressure1 | 4 | 0.0149813 |
Hyd Pressure4 | 4 | 0.0149813 |
PSC Fill | 3 | 0.0112360 |
Oxygen Filler | 3 | 0.0112360 |
Alch Rel | 3 | 0.0112360 |
Fill Pressure | 2 | 0.0074906 |
Filler Level | 2 | 0.0074906 |
Temperature | 2 | 0.0074906 |
Usage cont | 2 | 0.0074906 |
Pressure Setpoint | 2 | 0.0074906 |
Carb Rel | 2 | 0.0074906 |
Carb Volume | 1 | 0.0037453 |
Carb Temp | 1 | 0.0037453 |
Hyd Pressure2 | 1 | 0.0037453 |
Hyd Pressure3 | 1 | 0.0037453 |
Density | 1 | 0.0037453 |
Balling | 1 | 0.0037453 |
Pressure Vacuum | 1 | 0.0037453 |
Bowl Setpoint | 1 | 0.0037453 |
Air Pressurer | 1 | 0.0037453 |
As we can see, there are a fair number of missing values. PH’s four missing values may not be topping the charts; nonetheless, it may be the most problematic. Guidance to handle this scenario is to remove the rows where the PH is null from the training and test sets so that would include removing the rows in the predictor and response sets. We will explain more on this when we get to that step prior to training our model.
We are temporarily recoding Brand Code
to be numeric so
that we can pivot our dataset to be long. And we have a handle on
missing values, so we are filtering them out.
train_data |>
mutate(
`Brand Code` = recode(
`Brand Code`,
"A" = 1,
"B" = 2,
"C" = 3,
"D" = 4
)
) |>
pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "values"
) |>
filter(!is.na(values)) |>
ggplot(aes(x = values)) +
geom_histogram(bins = 20) +
facet_wrap(~ variable, ncol = 4, scales = "free") +
labs(
title = "Distribution of Data",
x = "Values",
y = "Count"
)
We see a good amount of data with normal distributions. But we also see some that variables that stand out:
Bailing
, Bailing Lvl
and
Density
are bimodal. To a lesser extent, so is
Air Pressurer
.Filler Speed
is very right skewed, but has a crop of
suspicious values at the low end.Hyd Pressure1
, Hyd Pressure2
, and
Hyd Pressure3
all have a very large deposit of 0 values.
Similarly, Mnf Flow
has a large deposit of negative
values.Oxygen Filler
is extremely left skewed.Bowl Setpoint
and Pressure Setpoint
look
as if they are comprised of discrete values.Before we continue exploring, we are going to convert our categorical
variable (Brand Code
) to multiple variables using one hot
encoding. We are doing it now because we are about to examine
correlation and we want to include Brand Code
in that
study. We will store the results in train_data_pp
, which
will ultimately be our preprocessed training data. We will apply the
same transformation to test_data
so that we can use it for
prediction later.
model <- dummyVars(~ ., data = train_data)
train_data_pp <- predict(model, newdata = train_data) |>
as.data.frame()
test_data_pp <- predict(model, newdata = test_data) |>
as.data.frame()
We are using a correlation matrix to help us better understand the
relationships between predictor variables. We are leaving the target
(PH
) in the dataset so that we may learn of predictors with
which it has correlation. And so that we don’t remove predictors with
which it has strong relationships.
cor_matrix <- cor(train_data_pp, use = "pairwise.complete.obs")
corrplot::corrplot(cor_matrix, order = "hclust", tl.cex = 0.7)
And let’s identify those that we may want to remove because they are highly correlated. We will list variables with \(\ge .85\) correlation.
cor_matrix |>
as.data.frame() |>
rownames_to_column("variable") |>
pivot_longer(
-variable,
names_to = "correlation",
values_to = "value"
) |>
mutate(
value = round(value, 3),
abs_value = abs(value)
) |>
filter(
variable != correlation & abs_value >= 0.85
) |>
arrange(desc(abs_value)) |>
kable()
variable | correlation | value | abs_value |
---|---|---|---|
Balling | Balling Lvl |
0.978 | 0.978 |
Balling Lvl |
Balling | 0.978 | 0.978 |
Density | Balling | 0.955 | 0.955 |
Balling | Density | 0.955 | 0.955 |
Filler Level |
Bowl Setpoint |
0.949 | 0.949 |
Bowl Setpoint |
Filler Level |
0.949 | 0.949 |
Density | Balling Lvl |
0.948 | 0.948 |
Balling Lvl |
Density | 0.948 | 0.948 |
Filler Speed |
MFR | 0.930 | 0.930 |
MFR | Filler Speed |
0.930 | 0.930 |
Alch Rel |
Balling Lvl |
0.926 | 0.926 |
Balling Lvl |
Alch Rel |
0.926 | 0.926 |
Hyd Pressure2 |
Hyd Pressure3 |
0.925 | 0.925 |
Hyd Pressure3 |
Hyd Pressure2 |
0.925 | 0.925 |
Balling | Alch Rel |
0.925 | 0.925 |
Alch Rel |
Balling | 0.925 | 0.925 |
Density | Alch Rel |
0.903 | 0.903 |
Alch Rel |
Density | 0.903 | 0.903 |
Brand Code D |
Alch Rel |
0.899 | 0.899 |
Alch Rel |
Brand Code D |
0.899 | 0.899 |
We saw some weird distributions in our data. And in our summary statistics we saw that some of our variables had a lot of variance. We also saw that some of our variables had a lot of missing values. We are going to take a look at the data in a different way. We are going to plot it as a line plot. But our data is not a time series. Nonetheless, we think that it’s safe to assume that the observations were made over time. So, we will try plotting it with a line plot and see what patterns, if any, surface. We would like to emphasize that in no way are we suggesting that this should be interpreted as a time series. We are simply viewing it through the lens of a line plot.
We will use pivot_longer()
to reshape the data so that
we can plot it one variable stacked on top of another. We will also use
row_number()
to create an index for the x-axis. We will use
facet_wrap()
to create a separate plot for each
variable.
train_data_pp |>
mutate(
index = row_number()
) |>
relocate(
index,
.before = everything()
) |>
pivot_longer(
cols = -1,
names_to = "variable",
values_to = "values"
) |>
ggplot(aes(x = index, y = values)) +
geom_line() +
facet_wrap(~ variable, ncol = 1, scales = "free") +
labs(
title = "Line Plot of Variables",
x = "Index",
y = "Values"
)
We see a lot of several interesting patterns.
Mnf Flow
, Hyd Pressure1
,
Hyd Pressure2
, and Hyd Pressure3
have a lot of
consecutive observations with the same value. It is very likely that
these are the values we were seeing in our histograms.Carb Flow
’s pattern is consistent from 0 to 2000 at
which point it drops significantly.Carb Pressure1
exhibits what looks like two different
modes of operation: 0 to ~1350 and then ~1350 to the end. The “modes”
are separated by what appears to be a small gap.Oxygen Filler
also appears to have two modes of
operation.Usage Cont
may be the oddest variable of all of them.
It exhibits what looks like a noisy pattern up until ~1600 at which
point it radically goes high and stays there in a noisy pattern.PH
’s trend line is flat, but it clearly has a daily
cycle. Just kidding (though it does almost look like timeseries
data).We employed Principal Component Analysis (PCA) as a dimensionality reduction technique to better understand the structure of the predictor space and to address potential multicollinearity among variables. This will also give us our first glimpse at important predictors and related predictors, and we may recognize these patterns in our models, as well.
Given that the final modeling objective is to predict pH, PCA was applied strictly to the predictor variables (i.e., the input features), with the target variable excluded during the transformation phase.
The PCA was conducted using the caret and tidyverse packages in R. The procedure included the following key steps:
# Exclude target
predictors <- train_data_pp |>
select(-PH) |>
select(where(is.numeric))
# Preprocess: transform and standardize
predictors_prep <- preProcess(predictors,
method = c("YeoJohnson", "center",
"scale", "medianImpute"))
# Transform data using PCA
processed_data <- predict(predictors_prep, predictors)
# Perform PCA
pca_result <- prcomp(processed_data, center = FALSE, scale. = FALSE)
# Extract the variance explained from the PCA model
var_explained <- summary(pca_result)$importance[2,]
# Create the scree plot data frame. Need to factor PCs so they populate the
# plot correctly
scree_df <- data.frame(
PC = factor(paste0("PC", 1:length(var_explained)), levels = paste0("PC", 1:length(var_explained))),
Variance = var_explained
)
# Filter to simplify visualization
scree_df_small <- scree_df[1:15, ]
# Plot
ggplot(scree_df_small, aes(x = PC, y = Variance)) +
geom_smooth(aes(group = 1), color = "darkblue", linewidth = 1, se = FALSE) +
ggtitle("Scree Plot") +
xlab("Principal Components") +
ylab("Proportion of Variance Explained") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We can analyze the graph above as a heuristic approach for determining the number of Principle Components to keep. The proportion of variance explained starts to taper off at PC6, so we would consider the first five PCs as the most important for modeling. Components beyond this contribute marginally to the variance. Furthermore, we can examine the cumulative variance for each PC we add.
cumulative_variance <- summary(pca_result)$importance[3,]
scree_df$Cumulative_Variance <- cumulative_variance
# Print table
scree_df |>
as_tibble() |>
filter(row_number() < 16) |>
kable()
PC | Variance | Cumulative_Variance |
---|---|---|
PC1 | 0.20922 | 0.20922 |
PC2 | 0.16138 | 0.37060 |
PC3 | 0.07006 | 0.44066 |
PC4 | 0.05272 | 0.49337 |
PC5 | 0.04738 | 0.54075 |
PC6 | 0.04389 | 0.58464 |
PC7 | 0.04048 | 0.62512 |
PC8 | 0.03662 | 0.66174 |
PC9 | 0.03474 | 0.69648 |
PC10 | 0.03118 | 0.72766 |
PC11 | 0.02893 | 0.75659 |
PC12 | 0.02553 | 0.78212 |
PC13 | 0.02519 | 0.80731 |
PC14 | 0.02249 | 0.82981 |
PC15 | 0.02216 | 0.85196 |
From this output, we observe that:
The first few components capture a substantial portion of the total variance.
The rotation matrix provides the loadings of each original variable on the principal components. Loadings close to ±1 indicate strong influence, while values near 0 indicate minimal contribution.
pca_result$rotation[,1:5] |>
as.data.frame() |>
arrange(desc(abs(PC1)))
## PC1 PC2 PC3 PC4
## `Alch Rel` 0.351917110 0.018637006 0.015947645 -0.021509091
## Balling 0.350242357 -0.038655554 0.010768884 -0.051510765
## `Balling Lvl` 0.347587056 0.018036174 0.058019203 -0.060589501
## Density 0.342556769 0.003647184 -0.002780885 -0.008220558
## `Carb Rel` 0.327287394 0.040790855 0.033190393 -0.055873034
## `Carb Volume` 0.312671225 -0.010849528 0.077849440 0.014418396
## `Brand Code`D 0.309509885 0.024613794 -0.030268525 0.036554770
## `Brand Code`B -0.281886397 -0.018590616 -0.081590948 0.205154803
## `Hyd Pressure4` -0.216438550 -0.010538108 0.259170039 -0.190256519
## `Carb Pressure` 0.185499136 0.002969658 0.036219506 0.321843460
## `Brand Code`A 0.102964974 -0.011611054 0.125500991 -0.121825258
## `Pressure Setpoint` -0.097974310 -0.253715282 0.097041709 0.017762720
## Temperature -0.085593182 0.073031414 0.238809759 -0.191529526
## `Brand Code`C -0.080867400 0.007252593 0.040045941 -0.239356190
## `PC Volume` -0.070027461 0.044363458 -0.300849404 -0.245631858
## `Fill Pressure` -0.055001282 -0.243216724 -0.006425723 0.067604723
## `Air Pressurer` -0.041131334 0.025209215 0.098674448 0.289146621
## `Hyd Pressure2` 0.034053272 -0.356389157 -0.234508889 -0.128558268
## `Hyd Pressure3` 0.033357301 -0.371170916 -0.209235867 -0.097374419
## `Oxygen Filler` -0.033055156 0.197534661 -0.049257586 -0.014325367
## `Pressure Vacuum` -0.032736654 0.268309580 -0.008011089 0.199000520
## `Fill Ounces` -0.031395082 0.003712749 0.119973827 0.148610705
## PSC -0.027814800 -0.009774838 -0.058858968 -0.040354692
## `Mnf Flow` 0.024485822 -0.370246452 0.074307585 -0.015327357
## `PSC CO2` -0.024235975 -0.017887178 0.046661968 0.026701097
## `Carb Temp` 0.023503641 0.009142603 -0.001680053 0.344901533
## `Filler Speed` 0.023311543 -0.069554166 -0.401263373 0.258147919
## `Usage cont` 0.020636458 -0.235304698 0.168029336 0.032977648
## `Bowl Setpoint` 0.019544205 0.293735628 -0.180089269 -0.265432969
## MFR 0.019144953 0.014987724 -0.105029957 0.157232038
## `Hyd Pressure1` 0.015014704 -0.293051774 -0.294238611 -0.220849777
## `Carb Flow` -0.012178379 0.078728206 -0.353299056 0.232580515
## `Filler Level` 0.009614800 0.297688391 -0.140515049 -0.270964806
## `PSC Fill` -0.008729149 0.020514009 0.031714068 -0.013912536
## `Carb Pressure1` 0.006908869 -0.148667635 0.370498100 0.012035389
## PC5
## `Alch Rel` -0.0333321552
## Balling 0.0012498277
## `Balling Lvl` 0.0180168311
## Density 0.0169572074
## `Carb Rel` 0.0042927136
## `Carb Volume` -0.1089099266
## `Brand Code`D -0.0990504254
## `Brand Code`B -0.1349954459
## `Hyd Pressure4` 0.1670885478
## `Carb Pressure` 0.4985690147
## `Brand Code`A 0.1690738919
## `Pressure Setpoint` 0.0888629693
## Temperature 0.2261451810
## `Brand Code`C 0.1686322704
## `PC Volume` 0.2243499384
## `Fill Pressure` 0.0701692476
## `Air Pressurer` -0.0159749133
## `Hyd Pressure2` 0.0620873110
## `Hyd Pressure3` 0.0648982453
## `Oxygen Filler` 0.0873573939
## `Pressure Vacuum` -0.0831161172
## `Fill Ounces` -0.0698995664
## PSC -0.0053777634
## `Mnf Flow` -0.0175941414
## `PSC CO2` 0.0701332070
## `Carb Temp` 0.6153834505
## `Filler Speed` -0.1399971362
## `Usage cont` -0.1300730605
## `Bowl Setpoint` 0.0579414336
## MFR -0.1818287529
## `Hyd Pressure1` 0.1129535985
## `Carb Flow` 0.0651455166
## `Filler Level` 0.0526725738
## `PSC Fill` 0.0004411254
## `Carb Pressure1` -0.0783446670
By examining the loading structure:
We can interpret PC1 as a linear combination emphasizing variables
such as Alch Rel
Balling
and
Density
.
Components with clear thematic groupings (e.g., all chemistry variables, or all environmental sensors) enhance interpretability and suggest latent structures in the data.
We have discovered much and have some work to do. Some preprocessing we will apply to our training and testing datasets to be used for all regression models. Other preprocessing, such as centering, scaling and PCA will be applied more selectively per model.
NA
’s in our target variable.NA
, but looks as if
it should be.First, we have already dealt with our categorical data in
train_data_pp
and test_data_pp
. We shall use
them as a starting place for further preprocessing.
We will deal with our missing PH
data. We don’t want to
train our models to predict NA
. So we are going to drop the
four rows missing PH
values.
train_data_pp <- train_data_pp |>
filter(!is.na(`PH`))
# test_data_pp is already void of PH values
And now we will remove highly correlated variables. We will use a correlation threshold of 0.95. While 0.75 is a common starting point, retaining more features helps maintain signal for models like PLS, which can internally handle correlated predictors by extracting latent factors.
high_corr <- train_data_pp |>
select(-PH) |>
cor(use = "pairwise.complete.obs") |>
findCorrelation(cutoff = 0.95)
# print the names of the highly correlated variables
glue("Removing columns: {colnames(train_data_pp)[high_corr]}")
## Removing columns: Balling
# remove the highly correlated variables
train_data_pp <- train_data_pp[, -high_corr]
test_data_pp <- test_data_pp[, -high_corr]
For missing data we will impute, but the suggested methods of imputation for different models varies, so we are going to leave that processing to be done at training time.
Our line plots revealed some data that is suspiciously constant over
many consecutive observations: Mnf Flow
,
Hyd Pressure1
, Hyd Pressure2
, and
Hyd Pressure3
. We suspect that they are effectively missing
data. Do we want to do any of the following?
NA
and impute
them.We are going to leave them as they are. We don’t know enough about the data to feel confident in assuming they are missing data. We are going to assume that they are legitimate values.
train_data_pp <- train_data_pp |>
mutate(
# convert the NAs encoded as numbers into proper NAs.
`Mnf Flow` = if_else(`Mnf Flow` < 0, NA, `Mnf Flow`),
# For those with legitimate 0 values, this is flawed.
`Hyd Pressure1` = if_else(`Hyd Pressure1` == 0, NA, `Hyd Pressure1`),
`Hyd Pressure2` = if_else(`Hyd Pressure2` == 0, NA, `Hyd Pressure2`),
`Hyd Pressure3` = if_else(`Hyd Pressure3` == 0, NA, `Hyd Pressure3`)
)
This is our little workshop of utilities. You will see below in
preprocess
that we don’t actually partition our data. Our
dataset being somewhat small, we have decided to use cross validation to
test the performance of our models. We think it is better because it
uses different combinations of data to train and test our models. If we
partition our training dataset then we are compromising our training and
testing in one of two mutually exclusive ways. If we chose to partition
it and not use cross validation, then we are training our model with
only one snapshot of our data, which could lead to bias. If we chose to
partition it and use cross validation, then we are training it with an
even smaller dataset. We feel we have everything to gain by not
partitioning the data and nothing to lose.
preprocess()
is a simple utility that sets the random
seed in the hopes that all models will use the same cross validated
sequence of data. It extracts y
from data
, and
applies the preprocessing methods
to the predictors. It
returns a list including the preprocessed data, the preprocess model,
and the response variable.
preprocess <- function(data, methods) {
set.seed(31415)
# separate the data into predictors and targets
X <- dplyr::select(data, -PH)
y <- data$PH
# preprocess the data
model <- preProcess(X, method = methods)
X <- predict(model, X)
list(
PPM = model,
X = X,
y = y
)
}
report()
reports accuracy with MAE, RMSE and R2. It also
reports the most important variables in descending order.
report <- function(model) {
mae = mean(model$results$MAE)
rmse = mean(model$results$RMSE)
r2 = mean(model$results$Rsquared)
glue("{model$method}: MAE={round(mae, 2)}, RMSE={round(rmse, 2)}, R2={round(r2, 2)}") |>
print()
glue("Best tuned parameters: {colnames(model$bestTune)} = {model$bestTune}") |>
print()
print(varImp(model))
}
ctrl
is our directive to the training process to use
cross validation to train and evaluate our models. Most models will use
this control method. Some models will use a less intense cross
validation method. For example, Random Forest takes a long time to run
with our default parameters. We are cutting down on cross validation for
Random Forest to 2 folds and 5 repeats.
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
data <- preprocess(
train_data_pp,
method = c("knnImpute", "YeoJohnson", "center", "scale")
)
pls_model <- train(
x = data$X,
y = data$y,
method = "pls",
tuneLength = 20,
trControl = ctrl
)
report(pls_model)
## pls: MAE=0.11, RMSE=0.14, R2=0.38
## Best tuned parameters: ncomp = 20
## pls variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## `\\`Mnf Flow\\`` 100.00
## `\\`Brand Code\\`C` 80.48
## `\\`Usage cont\\`` 64.55
## `\\`Bowl Setpoint\\`` 64.05
## `\\`Filler Level\\`` 54.85
## `\\`Hyd Pressure3\\`` 52.32
## Temperature 49.47
## `\\`Pressure Setpoint\\`` 49.16
## `\\`Hyd Pressure2\\`` 46.01
## `\\`Brand Code\\`B` 43.27
## `\\`Fill Pressure\\`` 40.36
## `\\`Hyd Pressure1\\`` 36.08
## `\\`Pressure Vacuum\\`` 35.67
## `\\`Carb Pressure1\\`` 33.63
## `\\`Carb Flow\\`` 33.53
## `\\`Brand Code\\`D` 30.95
## `\\`Brand Code\\`A` 29.53
## `\\`Oxygen Filler\\`` 29.38
## `\\`Hyd Pressure4\\`` 23.45
## `\\`Carb Rel\\`` 23.19
Ordinary Least Squares (OLS) is a benchmark linear modeling method. We’ve already preprocessed the data in a way which suits both PLS and OLS. We’ll determine which of these two models is best among linear regression - to do that let’s fit an OLS model.
ols_model <- train(
x = data$X,
y = data$y,
method = "lm",
trControl = ctrl
)
report(ols_model)
## lm: MAE=0.1, RMSE=0.13, R2=0.4
## Best tuned parameters: intercept = TRUE
## lm variable importance
##
## only 20 most important variables shown (out of 33)
##
## Overall
## `\\`Mnf Flow\\`` 100.000
## `\\`Carb Pressure1\\`` 61.468
## Temperature 38.469
## `\\`Usage cont\\`` 37.312
## `\\`Bowl Setpoint\\`` 34.030
## `\\`Brand Code\\`C` 32.573
## `\\`Hyd Pressure3\\`` 31.683
## Density 29.763
## `\\`Pressure Setpoint\\`` 26.424
## `\\`Oxygen Filler\\`` 26.261
## `\\`Brand Code\\`A` 18.418
## `\\`Alch Rel\\`` 16.912
## `\\`Carb Flow\\`` 13.997
## `\\`Filler Level\\`` 12.102
## `\\`Fill Ounces\\`` 11.678
## `\\`PSC CO2\\`` 11.650
## `\\`Hyd Pressure2\\`` 11.131
## `\\`Carb Rel\\`` 10.193
## `\\`PC Volume\\`` 9.572
## `\\`PSC Fill\\`` 8.811
OLS is interpretable and useful as a baseline. It assumes linear relationships and independence. It may not perform as well as more complex models if predictors are collinear or relationships are nonlinear.
To further assess whether regularization improves performance, models like Lasso, Ridge, and Elastic Net could be explored. These approaches can reduce overfitting and handle correlated predictors more effectively than OLS. In future work or production deployment, it would be advisable to test these variants and compare their performance using the same repeated cross-validation framework.
lasso_model <- train(
x = data$X,
y = data$y,
method = "lasso",
tuneLength = 20,
trControl = ctrl,
preProcess = NULL
)
report(lasso_model)
## lasso: MAE=0.11, RMSE=0.14, R2=0.37
## Best tuned parameters: fraction = 0.9
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## `Mnf Flow` 100.000
## `Bowl Setpoint` 59.594
## `Filler Level` 51.518
## `Usage cont` 49.516
## `Pressure Setpoint` 46.216
## `Carb Flow` 41.604
## `Brand Code`C 39.243
## `Pressure Vacuum` 26.382
## `Hyd Pressure3` 26.289
## `Hyd Pressure2` 22.913
## `Fill Pressure` 20.184
## `Brand Code`D 16.949
## `Oxygen Filler` 14.012
## `Carb Rel` 12.635
## Temperature 11.965
## `Hyd Pressure1` 11.618
## `Alch Rel` 10.601
## `Hyd Pressure4` 10.396
## `Brand Code`A 4.905
## MFR 4.545
ridge_model <- train(
x = data$X,
y = data$y,
method = "ridge",
tuneLength = 20,
trControl = ctrl,
preProcess = NULL
)
report(ridge_model)
## ridge: MAE=0.1, RMSE=0.13, R2=0.4
## Best tuned parameters: lambda = 0.01
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## `Mnf Flow` 100.000
## `Bowl Setpoint` 59.594
## `Filler Level` 51.518
## `Usage cont` 49.516
## `Pressure Setpoint` 46.216
## `Carb Flow` 41.604
## `Brand Code`C 39.243
## `Pressure Vacuum` 26.382
## `Hyd Pressure3` 26.289
## `Hyd Pressure2` 22.913
## `Fill Pressure` 20.184
## `Brand Code`D 16.949
## `Oxygen Filler` 14.012
## `Carb Rel` 12.635
## Temperature 11.965
## `Hyd Pressure1` 11.618
## `Alch Rel` 10.601
## `Hyd Pressure4` 10.396
## `Brand Code`A 4.905
## MFR 4.545
elastic_model <- train(
x = data$X,
y = data$y,
method = "glmnet",
tuneLength = 20,
trControl = ctrl,
preProcess = NULL
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
report(elastic_model)
## glmnet: MAE=0.11, RMSE=0.14, R2=NaN
## Best tuned parameters: alpha = 0.952631578947368
## Best tuned parameters: lambda = 0.000335359611608051
## glmnet variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## `Mnf Flow` 100.000
## `Hyd Pressure3` 62.568
## `Brand Code`C 50.353
## `Bowl Setpoint` 49.733
## `Carb Pressure1` 37.185
## Density 36.771
## `Alch Rel` 32.715
## `Usage cont` 24.122
## Temperature 21.838
## `Pressure Setpoint` 19.312
## `Hyd Pressure2` 16.036
## `Oxygen Filler` 15.373
## `Brand Code`A 14.679
## `Filler Level` 13.963
## `Carb Rel` 10.869
## `Carb Flow` 10.216
## `Carb Volume` 9.254
## `Fill Pressure` 7.366
## `Fill Ounces` 6.815
## `Hyd Pressure1` 6.218
We see a warning after running the model for Elastic net that there were missing values in resampled performance measures. Since our model finished training and we still got valid results, and as we will see below our metrics for Elastic Net are close to the others, we will likely not recommend selecting it as our final model unless it offers clear advantages (which we will see that it does not.)
After training and comparing five linear modeling approaches — OLS, PLS, Lasso, Ridge, and Elastic Net — we find that performance across all methods is remarkably similar. Each model was evaluated using repeated 10-fold cross-validation, and results were compared using RMSE (Root Mean Squared Error), R-squared, and MAE (Mean Absolute Error).
The best RMSE and best R-squared values were achieved by OLS and Elastic Net, but the differences between all models are minuscule (less than 0.0001 RMSE units). The lowest MAE was from OLS at 0.104241, but again, the margin is extremely small.
Given the similarity in performance, we recommend Ordinary Least Squares (OLS) for this use case:
Elastic Net showed nearly identical performance but triggered a convergence warning during resampling, suggesting mild instability. If regularization becomes important due to changing data conditions or overfitting concerns in future phases, Lasso or Elastic Net could be revisited.
This modeling exercise confirms that pH can be predicted with consistent accuracy using standard linear models. The choice of model does not significantly alter accuracy, allowing the business to prioritize simplicity and transparency. Implementing the OLS model enables stakeholders to:
data <- preprocess(train_data_pp, c("center", "scale", "medianImpute", "pca"))
k-Nearest Neighbors is a non-parametric method used for classification and regression. It works by finding the k-nearest neighbors to a given point and using their values to predict the value of that point.
knn_model <- train(
x = data$X,
y = data$y,
method = "knn",
tuneLength = 20,
trControl = ctrl
)
report(knn_model)
## knn: MAE=0.1, RMSE=0.13, R2=0.45
## Best tuned parameters: k = 9
## loess r-squared variable importance
##
## Overall
## PC2 100.0000
## PC1 32.9145
## PC6 30.1517
## PC7 27.3098
## PC17 16.6429
## PC4 13.9365
## PC8 13.1987
## PC20 11.9676
## PC3 9.8122
## PC15 8.7371
## PC5 8.3752
## PC18 7.8995
## PC9 7.2142
## PC16 7.1215
## PC19 6.2098
## PC12 3.1794
## PC13 2.5477
## PC10 1.5031
## PC11 0.5908
## PC14 0.0000
Support Vector Machines (SVM) is a supervised learning algorithm that can be used for classification or regression. It works by finding the hyperplane that best separates the data into different classes. In this case, we are using it for regression and will try three different SVM models: linear, polynomial, and radial.
svml_model <- train(
x = data$X,
y = data$y,
method = "svmLinear",
tuneLength = 10,
trControl = ctrl
)
report(svml_model)
## svmLinear: MAE=0.11, RMSE=0.14, R2=0.34
## Best tuned parameters: C = 1
## loess r-squared variable importance
##
## Overall
## PC2 100.0000
## PC1 32.9145
## PC6 30.1517
## PC7 27.3098
## PC17 16.6429
## PC4 13.9365
## PC8 13.1987
## PC20 11.9676
## PC3 9.8122
## PC15 8.7371
## PC5 8.3752
## PC18 7.8995
## PC9 7.2142
## PC16 7.1215
## PC19 6.2098
## PC12 3.1794
## PC13 2.5477
## PC10 1.5031
## PC11 0.5908
## PC14 0.0000
svmp_model <- train(
x = data$X,
y = data$y,
method = "svmPoly",
tuneLength = 3,
trControl = trainControl(
method = "repeatedcv",
number = 2,
repeats = 5
)
)
report(svmp_model)
## svmPoly: MAE=0.11, RMSE=0.15, R2=0.34
## Best tuned parameters: degree = 3
## Best tuned parameters: scale = 0.01
## Best tuned parameters: C = 1
## loess r-squared variable importance
##
## Overall
## PC2 100.0000
## PC1 32.9145
## PC6 30.1517
## PC7 27.3098
## PC17 16.6429
## PC4 13.9365
## PC8 13.1987
## PC20 11.9676
## PC3 9.8122
## PC15 8.7371
## PC5 8.3752
## PC18 7.8995
## PC9 7.2142
## PC16 7.1215
## PC19 6.2098
## PC12 3.1794
## PC13 2.5477
## PC10 1.5031
## PC11 0.5908
## PC14 0.0000
svmr_model <- train(
x = data$X,
y = data$y,
method = "svmRadial",
tuneLength = 5,
trControl = ctrl
)
report(svmr_model)
## svmRadial: MAE=0.09, RMSE=0.12, R2=0.49
## Best tuned parameters: sigma = 0.0322972739788172
## Best tuned parameters: C = 4
## loess r-squared variable importance
##
## Overall
## PC2 100.0000
## PC1 32.9145
## PC6 30.1517
## PC7 27.3098
## PC17 16.6429
## PC4 13.9365
## PC8 13.1987
## PC20 11.9676
## PC3 9.8122
## PC15 8.7371
## PC5 8.3752
## PC18 7.8995
## PC9 7.2142
## PC16 7.1215
## PC19 6.2098
## PC12 3.1794
## PC13 2.5477
## PC10 1.5031
## PC11 0.5908
## PC14 0.0000
Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression technique that can be used for both linear and non-linear regression. It works by fitting piece-wise linear functions to the data.
We are using a less intensive cross validation method for MARS because it takes a long time to run with our default parameters. We are cutting down on cross validation to 2 folds and 5 repeats.
grid = expand.grid(
degree = 1:2,
nprune = 5:10
)
mars_model <- train(
x = data$X,
y = data$y,
method = "earth",
tuneGrid = grid,
trControl = trainControl(
method = "repeatedcv",
number = 2,
repeats = 5
)
)
report(mars_model)
## earth: MAE=0.11, RMSE=0.14, R2=0.3
## Best tuned parameters: nprune = 10
## Best tuned parameters: degree = 2
## earth variable importance
##
## Overall
## PC2 100.000
## PC1 71.570
## PC6 71.570
## PC7 52.247
## PC17 28.489
## PC18 24.062
## PC13 16.917
## PC11 9.624
## PC15 9.624
## PC16 0.000
Neural Networks are a class of models that are inspired by the way the human brain works. They are used for both classification and regression tasks. In this case, we are using it for regression.
As with MARS, we are using a less intensive cross validation method for Neural Networks because it takes a long time to run with our default parameters. We are cutting down on cross validation to 2 folds and 5 repeats.
grid <- expand.grid(
size = c(2, 4, 6, 8, 10),
decay = c(0, 0.05, 0.1, 0.15)
)
nn_model <- train(
x = data$X,
y = data$y,
method = "nnet",
linout = TRUE,
trace = FALSE,
maxit = 1000,
tuneGrid = grid,
trControl = trainControl(
method = "repeatedcv",
number = 2,
repeats = 5
)
)
report(nn_model)
## nnet: MAE=0.11, RMSE=0.17, R2=0.38
## Best tuned parameters: size = 6
## Best tuned parameters: decay = 0.1
## nnet variable importance
##
## Overall
## PC20 100.00
## PC6 97.17
## PC2 93.23
## PC19 82.66
## PC7 82.64
## PC16 80.55
## PC17 68.43
## PC4 68.32
## PC1 58.24
## PC18 54.43
## PC10 44.00
## PC13 34.40
## PC15 33.81
## PC8 32.10
## PC5 30.84
## PC3 28.66
## PC11 28.27
## PC14 18.44
## PC12 10.77
## PC9 0.00
Of the non-linear models, the SVM radial model performed best with an R-squared of 0.485, followed closely by KNN with an R-squared of 0.455. Somewhat surprisingly, the MARS model had the lowest R-squared of the non linear bunch of 0.302, while the Neural Network model had an R-squared of 0.384.
And the MAE and RMSE, for the most part, reinforce R-squared’s story. The non-linear champion, SVM radial model, had the lowest MAE of 0.093 and lowest RMSE of 0.125. While, at the other extreme, MARS had the highest MAE of 0.114 and 2nd highest RMSE of 0.144.
rpart_data <- preprocess(train_data_pp, "medianImpute")
# column names with backticks are causing issues with the caret package.
rpart_data$X <- clean_names(rpart_data$X)
rpart_model <- train(
x = rpart_data$X,
y = rpart_data$y,
method = "rpart",
tuneLength = 10,
trControl = ctrl
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
report(rpart_model)
## rpart: MAE=0.11, RMSE=0.14, R2=0.34
## Best tuned parameters: cp = 0.0136301352354267
## rpart variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## mnf_flow 100.000
## temperature 88.758
## usage_cont 80.417
## filler_speed 76.020
## balling_lvl 69.764
## fill_pressure 51.009
## bowl_setpoint 48.788
## hyd_pressure3 48.601
## filler_level 45.544
## air_pressurer 45.273
## pressure_vacuum 41.667
## brand_code_c 40.155
## carb_flow 36.453
## pressure_setpoint 32.795
## carb_pressure1 32.761
## carb_rel 29.118
## oxygen_filler 28.459
## alch_rel 20.989
## brand_code_d 18.941
## carb_volume 7.517
Visualize the model
rpart.plot(
rpart_model$finalModel,
type = 2,
extra = 101,
fallen.leaves = TRUE,
main = "Decision Tree for pH Prediction"
)
The recursive partitioning model was built using the rpart method with 10 levels of complexity parameter (cp) tuning. The optimal cp value found was 0.0136, balancing tree complexity and performance. While the decision tree offers good interpretability, its predictive power was moderate, with an Rsquared of 0.34 indicating only limited explanatory capability for the variance in the response variable. The visualization of the decision tree helps understand how key variables influence the pH prediction.
The top three predictors identified were: mnf_flow (100% importance), brand_code_c (92.93) and pressure_vacuum (88.75).
rf_data <- rpart_data
# Random forest takes a very long time to run with our default parameters.
# We are cutting down on cross validation
rf_model <- train(
x = rf_data$X,
y = rf_data$y,
importance = TRUE,
trControl = trainControl(
method = "repeatedcv",
number = 2,
repeats = 5
)
)
report(rf_model)
## rf: MAE=0.08, RMSE=0.11, R2=0.63
## Best tuned parameters: mtry = 18
## rf variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## mnf_flow 100.00
## brand_code_c 92.93
## pressure_vacuum 88.75
## oxygen_filler 77.26
## balling_lvl 65.47
## air_pressurer 61.98
## temperature 59.73
## alch_rel 58.12
## carb_rel 58.11
## usage_cont 54.06
## density 52.87
## hyd_pressure3 51.18
## filler_speed 48.47
## carb_flow 45.49
## carb_pressure1 44.49
## hyd_pressure1 43.02
## bowl_setpoint 40.53
## hyd_pressure2 38.17
## fill_pressure 31.83
## filler_level 30.07
Visualize the model
varImpPlot(
rf_model$finalModel,
type = 1,
main = "Random Forest Variable Importance"
)
The Random Forest outperformed the decision tree, nearly doubling the explanatory power (Rsquared of 0.63). It provided much stronger predictive performance, suggesting it is better suited for this dataset despite longer training time and reduced interpretability.
Top predictors included: mnf_flow (100% importance), consistent with the decision tree.
Having used cross-validation to build our models, they are now populated with \(results\). Here we shall take a look at some of those summary statistics and compare one model to another.
model_list <- list(
PLS = pls_model,
OLS = ols_model,
Lasso = lasso_model,
Ridge = ridge_model,
ElasticNet = elastic_model,
KNN = knn_model,
SVM_linear = svml_model,
SVM_poly = svmp_model,
SVM_radial = svmr_model,
MARS = mars_model,
NeuralNet = nn_model,
RPart = rpart_model,
RandomForest = rf_model
)
# extract the results from each model
model_results <- tibble(
Model = names(model_list),
MAE = map(
model_list, function(x) mean(x$results$MAE, na.rm = TRUE)
) |> as_vector(),
RMSE = map(
model_list, function(x) mean(x$results$RMSE, na.rm = TRUE)
) |> as_vector(),
Rsquared = map(
model_list, function(x) mean(x$results$Rsquared, na.rm = TRUE)
) |> as_vector()
)
model_results |>
arrange(desc(Rsquared), MAE, RMSE) |>
mutate(
MAE = round(MAE, 3),
RMSE = round(RMSE, 3),
Rsquared = round(Rsquared, 3)
) |>
kable()
Model | MAE | RMSE | Rsquared |
---|---|---|---|
RandomForest | 0.080 | 0.108 | 0.626 |
SVM_radial | 0.093 | 0.125 | 0.485 |
KNN | 0.099 | 0.128 | 0.455 |
OLS | 0.104 | 0.134 | 0.396 |
Ridge | 0.104 | 0.134 | 0.396 |
NeuralNet | 0.109 | 0.171 | 0.384 |
PLS | 0.106 | 0.136 | 0.380 |
Lasso | 0.108 | 0.138 | 0.371 |
ElasticNet | 0.109 | 0.139 | 0.370 |
RPart | 0.111 | 0.141 | 0.340 |
SVM_poly | 0.113 | 0.148 | 0.338 |
SVM_linear | 0.110 | 0.141 | 0.337 |
MARS | 0.114 | 0.144 | 0.302 |
A champion, ideally, is a model with the lowest MAE, the lowest RMSE, and the highest R2. And we have such a champion. Please allow us to present the champion of our modeling competition: Random Forest. It is simple, moderately interpretable, but not very fast to compute.
Among all tested models, the Random Forest model did demonstrate the best overall performance, achieving the lowest MAE (0.080), lowest RMSE (0.108), and highest r squared (0.626), indicating superior predictive accuracy and explanatory power.
The SVM radial and KNN models followed, with moderate performance (r squared of 0.485 and 0.455, respectively), but did not match the precision of Random Forest. Linear models such as OLS, Ridge, and Elastic Net showed similar results, with r squared values clustered around 0.37–0.40.
The Neural Network, PLS, Lasso, and RPart (Decision Tree) models exhibited slightly weaker performance, with r squared values below 0.40. The MARS model had the lowest performance overall (r squared of 0.302).
In conclusion, Random Forest is the recommended model for this dataset due to its balance of accuracy and robustness, despite being more complex and computationally intensive.
Finally, we will make predictions using our champion model, Random Forest. We will use the test data to make predictions and save them to a CSV file.
# preprocess the data with the same model with which we preprocessed the training model
df <- predict(rf_data$PPM, test_data_pp)
# the variable names in our training set caused problems with `randomForest` so we
# "cleaned" them. The model is expecting the same variables.
df <- clean_names(df)
data.frame(
SampleID = 1:nrow(df),
Predicted_pH = predict(rf_model, newdata = df)
) |>
write.csv(
"Final_Predictions.csv",
row.names = FALSE
)