library(corrplot)
library(caret)
library(earth)
library(e1071)
library(Formula)
library(GGally)
library(ggpubr)
library(glmnet)
library(glue)
library(janitor)
library(knitr)
library(Metrics)
library(plotmo)
library(plotrix)
library(pls)
library(RANN)
library(randomForest)
library(readr)
library(readxl)
library(rpart)
library(rpart.plot)
library(tidyverse)
Project 2: Team Assignment
Assignment
ABC Beverage is conducting an internal review of its manufacturing process with a focus on understanding the factors that influence product pH levels. This technical report highlights our team’s data science-driven approach to modeling and predicting pH using historical production data.
To achieve this, we analyzed the dataset provided by ABC Beverage, performed data cleaning and exploratory analysis, and evaluated several models. These included linear and nonlinear regressions, decision trees, random forests, and more. Through the report we will outline our methodology, modeling process, evaluation strategy, and the final model selected for predicting pH levels.
Setup
Load the Data
First we must load the data. We have been provided two datasets. We presumed that one was designated for training (“StudentData.xlsx”) and the other for testing (“StudentEvaluation.xlsx”). But, we found “StudentEvaluation.xlsx” does not have target variable values. So, we concluded that “StudentData.xlsx” is for both training and testing and that “StudentEvaluation.xlsx” is for our final prediction.
<- read_excel("StudentData.xlsx")
train_data <- read_excel("StudentEvaluation.xlsx") test_data
EDA
George Hagstrom, our professor of DATA 607, defined exploratory data analysis as “…the art of looking at data in a systematic way in order to understand the the underlying structure of the data. It has two main goals: ensure data quality and uncover patterns to guide future analysis. EDA is detective work: you ask and answer questions, which inspires more questions.” And that is exactly what we will do. We are going to get to know our dataset[s] by summarizing it, scrutinizing it, looking at it’s correlation properties, and visualizing it.
Summary Information
We will take a look at a sample of the data. We shall also examine our data’s columns and types, the number of rows and columns, and the first few rows of the data. And finally, we will calculate some summary statistics: the mean, median, min, max, and standard deviation of each column so that we may better understand the nature of our data.
head(train_data)
# A tibble: 6 × 33
`Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
<chr> <dbl> <dbl> <dbl> <dbl>
1 B 5.34 24.0 0.263 68.2
2 A 5.43 24.0 0.239 68.4
3 B 5.29 24.1 0.263 70.8
4 A 5.44 24.0 0.293 63
5 A 5.49 24.3 0.111 67.2
6 A 5.38 23.9 0.269 66.6
# ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
# `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
# `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
# `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
# `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
# `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
# `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …
str(train_data)
tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
$ Brand Code : chr [1:2571] "B" "A" "B" "A" ...
$ Carb Volume : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
$ Fill Ounces : num [1:2571] 24 24 24.1 24 24.3 ...
$ PC Volume : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
$ Carb Pressure : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
$ Carb Temp : num [1:2571] 141 140 145 133 137 ...
$ PSC : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
$ PSC Fill : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
$ PSC CO2 : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
$ Mnf Flow : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
$ Carb Pressure1 : num [1:2571] 119 122 120 115 118 ...
$ Fill Pressure : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
$ Hyd Pressure1 : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
$ Hyd Pressure2 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
$ Hyd Pressure3 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
$ Hyd Pressure4 : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
$ Filler Level : num [1:2571] 121 119 120 118 119 ...
$ Filler Speed : num [1:2571] 4002 3986 4020 4012 4010 ...
$ Temperature : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
$ Usage cont : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
$ Carb Flow : num [1:2571] 2932 3144 2914 3062 3054 ...
$ Density : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
$ MFR : num [1:2571] 725 727 735 731 723 ...
$ Balling : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
$ Pressure Vacuum : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
$ PH : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
$ Oxygen Filler : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
$ Bowl Setpoint : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
$ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
$ Air Pressurer : num [1:2571] 143 143 142 146 146 ...
$ Alch Rel : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
$ Carb Rel : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
$ Balling Lvl : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
We can see there are 33 columns in total and 2,571 rows in total. The 33 columns includes the “PH” column which is what we will be aiming to predict – PH will be the response variable in our linear regression exploration.
We can see that we have mostly numeric values. The only character or categorical (non-numeric) is the Brand Code
. Based on information in our team’s preferred guidance, by Max Kuhn and Kjell Johnson, on predictive modeling, to deal with non-numeric values (a.k.a. categorical values, of which Brand Code is the only one) the recommendation is to either convert into dummy variables or remove if not informative or the value has too many categories. We have found that it has significant importance in some of our models, we will convert it to a dummy variable before training (since models such as Lasso requires input predictors to be numeric).
We can also see that there are negative values which means we won’t be able to apply the BoxCox method for processing our data. Our guidance and standards, from Applied Predictive Modeling, state that if the data includes negatives we should use the YeoJohnson method instead.
We will also take a look at the summary statistics of our data. We would normally use the summary()
function, but it prints wide and not long and does not present well. So, we are going to take a stick shift approach.
<- select(train_data, -`Brand Code`)
df data.frame(
mean = sapply(df, mean, na.rm = TRUE) |> round(3),
median = sapply(df, median, na.rm = TRUE) |> round(3),
min = sapply(df, min, na.rm = TRUE) |> round(3),
max = sapply(df, max, na.rm = TRUE) |> round(3),
sd = sapply(df, sd, na.rm = TRUE) |> round(3),
sd.normal = sapply(df, function(x) {
= x[!is.na(x)]
x sd(x) / (max(x) - min(x))
|> round(3)
}) |>
) kable()
mean | median | min | max | sd | sd.normal | |
---|---|---|---|---|---|---|
Carb Volume | 5.370 | 5.347 | 5.040 | 5.700 | 0.106 | 0.161 |
Fill Ounces | 23.975 | 23.973 | 23.633 | 24.320 | 0.088 | 0.127 |
PC Volume | 0.277 | 0.271 | 0.079 | 0.478 | 0.061 | 0.152 |
Carb Pressure | 68.190 | 68.200 | 57.000 | 79.400 | 3.538 | 0.158 |
Carb Temp | 141.095 | 140.800 | 128.600 | 154.000 | 4.037 | 0.159 |
PSC | 0.085 | 0.076 | 0.002 | 0.270 | 0.049 | 0.184 |
PSC Fill | 0.195 | 0.180 | 0.000 | 0.620 | 0.118 | 0.190 |
PSC CO2 | 0.056 | 0.040 | 0.000 | 0.240 | 0.043 | 0.179 |
Mnf Flow | 24.569 | 65.200 | -100.200 | 229.400 | 119.481 | 0.363 |
Carb Pressure1 | 122.586 | 123.200 | 105.600 | 140.200 | 4.743 | 0.137 |
Fill Pressure | 47.922 | 46.400 | 34.600 | 60.400 | 3.178 | 0.123 |
Hyd Pressure1 | 12.438 | 11.400 | -0.800 | 58.000 | 12.433 | 0.211 |
Hyd Pressure2 | 20.961 | 28.600 | 0.000 | 59.400 | 16.386 | 0.276 |
Hyd Pressure3 | 20.458 | 27.600 | -1.200 | 50.000 | 15.976 | 0.312 |
Hyd Pressure4 | 96.289 | 96.000 | 52.000 | 142.000 | 13.123 | 0.146 |
Filler Level | 109.252 | 118.400 | 55.800 | 161.200 | 15.698 | 0.149 |
Filler Speed | 3687.199 | 3982.000 | 998.000 | 4030.000 | 770.820 | 0.254 |
Temperature | 65.968 | 65.600 | 63.600 | 76.200 | 1.383 | 0.110 |
Usage cont | 20.993 | 21.790 | 12.080 | 25.900 | 2.978 | 0.215 |
Carb Flow | 2468.354 | 3028.000 | 26.000 | 5104.000 | 1073.696 | 0.211 |
Density | 1.174 | 0.980 | 0.240 | 1.920 | 0.378 | 0.225 |
MFR | 704.049 | 724.000 | 31.400 | 868.600 | 73.898 | 0.088 |
Balling | 2.198 | 1.648 | -0.170 | 4.012 | 0.931 | 0.223 |
Pressure Vacuum | -5.216 | -5.400 | -6.600 | -3.600 | 0.570 | 0.190 |
PH | 8.546 | 8.540 | 7.880 | 9.360 | 0.173 | 0.117 |
Oxygen Filler | 0.047 | 0.033 | 0.002 | 0.400 | 0.047 | 0.117 |
Bowl Setpoint | 109.327 | 120.000 | 70.000 | 140.000 | 15.303 | 0.219 |
Pressure Setpoint | 47.615 | 46.000 | 44.000 | 52.000 | 2.039 | 0.255 |
Air Pressurer | 142.834 | 142.600 | 140.800 | 148.200 | 1.212 | 0.164 |
Alch Rel | 6.897 | 6.560 | 5.280 | 8.620 | 0.505 | 0.151 |
Carb Rel | 5.437 | 5.400 | 4.960 | 6.060 | 0.129 | 0.117 |
Balling Lvl | 2.050 | 1.480 | 0.000 | 3.660 | 0.870 | 0.238 |
Looking at the variance, we see that MFR
is very low. On the other hand, Mnf Flow
and Hyd Pressure3
have relatively high amounts of variance.
Missing Values (NA
)
Let’s quantify how much data is missing from our data.
<- function(data) {
count_missing |>
data summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(
everything(),
names_to = "variable",
values_to = "missing"
|>
) filter(missing > 0) |>
mutate(
ratio = missing / nrow(data)
|>
) arrange(desc(missing))
}
count_missing(train_data) |>
kable()
variable | missing | ratio |
---|---|---|
MFR | 212 | 0.0824582 |
Brand Code | 120 | 0.0466744 |
Filler Speed | 57 | 0.0221704 |
PC Volume | 39 | 0.0151692 |
PSC CO2 | 39 | 0.0151692 |
Fill Ounces | 38 | 0.0147802 |
PSC | 33 | 0.0128355 |
Carb Pressure1 | 32 | 0.0124465 |
Hyd Pressure4 | 30 | 0.0116686 |
Carb Pressure | 27 | 0.0105018 |
Carb Temp | 26 | 0.0101128 |
PSC Fill | 23 | 0.0089459 |
Fill Pressure | 22 | 0.0085570 |
Filler Level | 20 | 0.0077791 |
Hyd Pressure2 | 15 | 0.0058343 |
Hyd Pressure3 | 15 | 0.0058343 |
Temperature | 14 | 0.0054454 |
Oxygen Filler | 12 | 0.0046674 |
Pressure Setpoint | 12 | 0.0046674 |
Hyd Pressure1 | 11 | 0.0042785 |
Carb Volume | 10 | 0.0038895 |
Carb Rel | 10 | 0.0038895 |
Alch Rel | 9 | 0.0035006 |
Usage cont | 5 | 0.0019448 |
PH | 4 | 0.0015558 |
Mnf Flow | 2 | 0.0007779 |
Carb Flow | 2 | 0.0007779 |
Bowl Setpoint | 2 | 0.0007779 |
Density | 1 | 0.0003890 |
Balling | 1 | 0.0003890 |
Balling Lvl | 1 | 0.0003890 |
count_missing(test_data) |>
kable()
variable | missing | ratio |
---|---|---|
PH | 267 | 1.0000000 |
MFR | 31 | 0.1161049 |
Filler Speed | 10 | 0.0374532 |
Brand Code | 8 | 0.0299625 |
Fill Ounces | 6 | 0.0224719 |
PSC | 5 | 0.0187266 |
PSC CO2 | 5 | 0.0187266 |
PC Volume | 4 | 0.0149813 |
Carb Pressure1 | 4 | 0.0149813 |
Hyd Pressure4 | 4 | 0.0149813 |
PSC Fill | 3 | 0.0112360 |
Oxygen Filler | 3 | 0.0112360 |
Alch Rel | 3 | 0.0112360 |
Fill Pressure | 2 | 0.0074906 |
Filler Level | 2 | 0.0074906 |
Temperature | 2 | 0.0074906 |
Usage cont | 2 | 0.0074906 |
Pressure Setpoint | 2 | 0.0074906 |
Carb Rel | 2 | 0.0074906 |
Carb Volume | 1 | 0.0037453 |
Carb Temp | 1 | 0.0037453 |
Hyd Pressure2 | 1 | 0.0037453 |
Hyd Pressure3 | 1 | 0.0037453 |
Density | 1 | 0.0037453 |
Balling | 1 | 0.0037453 |
Pressure Vacuum | 1 | 0.0037453 |
Bowl Setpoint | 1 | 0.0037453 |
Air Pressurer | 1 | 0.0037453 |
As we can see, there are a fair number of missing values. PH’s four missing values may not be topping the charts; nonetheless, it may be the most problematic. Guidance to handle this scenario is to remove the rows where the PH is null from the training and test sets so that would include removing the rows in the predictor and response sets. We will explain more on this when we get to that step prior to training our model.
Distribution
We are temporarily recoding Brand Code
to be numeric so that we can pivot our dataset to be long. And we have a handle on missing values, so we are filtering them out.
|>
train_data mutate(
`Brand Code` = recode(
`Brand Code`,
"A" = 1,
"B" = 2,
"C" = 3,
"D" = 4
)|>
) pivot_longer(
cols = everything(),
names_to = "variable",
values_to = "values"
|>
) filter(!is.na(values)) |>
ggplot(aes(x = values)) +
geom_histogram(bins = 20) +
facet_wrap(~ variable, ncol = 4, scales = "free") +
labs(
title = "Distribution of Data",
x = "Values",
y = "Count"
)
We see a good amount of data with normal distributions. But we also see some that variables that stand out:
Bailing
,Bailing Lvl
andDensity
are bimodal. To a lesser extent, so isAir Pressurer
.Filler Speed
is very right skewed, but has a crop of suspicious values at the low end.Hyd Pressure1
,Hyd Pressure2
, andHyd Pressure3
all have a very large deposit of 0 values. Similarly,Mnf Flow
has a large deposit of negative values.Oxygen Filler
is extremely left skewed.Bowl Setpoint
andPressure Setpoint
look as if they are comprised of discrete values.
One Hot Encoding
Before we continue exploring, we are going to convert our categorical variable (Brand Code
) to multiple variables using one hot encoding. We are doing it now because we are about to examine correlation and we want to include Brand Code
in that study. We will store the results in train_data_pp
, which will ultimately be our preprocessed training data. We will apply the same transformation to test_data
so that we can use it for prediction later.
<- dummyVars(~ ., data = train_data)
model <- predict(model, newdata = train_data) |>
train_data_pp as.data.frame()
<- predict(model, newdata = test_data) |>
test_data_pp as.data.frame()
Correlation
We are using a correlation matrix to help us better understand the relationships between predictor variables. We are leaving the target (PH
) in the dataset so that we may learn of predictors with which it has correlation. And so that we don’t remove predictors with which it has strong relationships.
<- cor(train_data_pp, use = "pairwise.complete.obs")
cor_matrix ::corrplot(cor_matrix, order = "hclust", tl.cex = 0.7) corrplot
And let’s identify those that we may want to remove because they are highly correlated. We will list variables with \(\ge .85\) correlation.
|>
cor_matrix as.data.frame() |>
rownames_to_column("variable") |>
pivot_longer(
-variable,
names_to = "correlation",
values_to = "value"
|>
) mutate(
value = round(value, 3),
abs_value = abs(value)
|>
) filter(
!= correlation & abs_value >= 0.85
variable |>
) arrange(desc(abs_value)) |>
kable()
variable | correlation | value | abs_value |
---|---|---|---|
Balling | Balling Lvl |
0.978 | 0.978 |
Balling Lvl |
Balling | 0.978 | 0.978 |
Density | Balling | 0.955 | 0.955 |
Balling | Density | 0.955 | 0.955 |
Filler Level |
Bowl Setpoint |
0.949 | 0.949 |
Bowl Setpoint |
Filler Level |
0.949 | 0.949 |
Density | Balling Lvl |
0.948 | 0.948 |
Balling Lvl |
Density | 0.948 | 0.948 |
Filler Speed |
MFR | 0.930 | 0.930 |
MFR | Filler Speed |
0.930 | 0.930 |
Alch Rel |
Balling Lvl |
0.926 | 0.926 |
Balling Lvl |
Alch Rel |
0.926 | 0.926 |
Hyd Pressure2 |
Hyd Pressure3 |
0.925 | 0.925 |
Hyd Pressure3 |
Hyd Pressure2 |
0.925 | 0.925 |
Balling | Alch Rel |
0.925 | 0.925 |
Alch Rel |
Balling | 0.925 | 0.925 |
Density | Alch Rel |
0.903 | 0.903 |
Alch Rel |
Density | 0.903 | 0.903 |
Brand Code D |
Alch Rel |
0.899 | 0.899 |
Alch Rel |
Brand Code D |
0.899 | 0.899 |
Linear View
We saw some weird distributions in our data. And in our summary statistics we saw that some of our variables had a lot of variance. We also saw that some of our variables had a lot of missing values. We are going to take a look at the data in a different way. We are going to plot it as a line plot. But our data is not a time series. Nonetheless, we think that it’s safe to assume that the observations were made over time. So, we will try plotting it with a line plot and see what patterns, if any, surface. We would like to emphasize that in no way are we suggesting that this should be interpreted as a time series. We are simply viewing it through the lens of a line plot.
We will use pivot_longer()
to reshape the data so that we can plot it one variable stacked on top of another. We will also use row_number()
to create an index for the x-axis. We will use facet_wrap()
to create a separate plot for each variable.
|>
train_data_pp mutate(
index = row_number()
|>
) relocate(
index,.before = everything()
|>
) pivot_longer(
cols = -1,
names_to = "variable",
values_to = "values"
|>
) ggplot(aes(x = index, y = values)) +
geom_line() +
facet_wrap(~ variable, ncol = 1, scales = "free") +
labs(
title = "Line Plot of Variables",
x = "Index",
y = "Values"
)
We see a lot of several interesting patterns.
- We see that
Mnf Flow
,Hyd Pressure1
,Hyd Pressure2
, andHyd Pressure3
have a lot of consecutive observations with the same value. It is very likely that these are the values we were seeing in our histograms. Carb Flow
’s pattern is consistent from 0 to 2000 at which point it drops significantly.Carb Pressure1
exhibits what looks like two different modes of operation: 0 to ~1350 and then ~1350 to the end. The “modes” are separated by what appears to be a small gap.Oxygen Filler
also appears to have two modes of operation.Usage Cont
may be the oddest variable of all of them. It exhibits what looks like a noisy pattern up until ~1600 at which point it radically goes high and stays there in a noisy pattern.PH
’s trend line is flat, but it clearly has a daily cycle. Just kidding (though it does almost look like timeseries data).
PCA Study
We employed Principal Component Analysis (PCA) as a dimensionality reduction technique to better understand the structure of the predictor space and to address potential multicollinearity among variables. This will also give us our first glimpse at important predictors and related predictors, and we may recognize these patterns in our models, as well.
Given that the final modeling objective is to predict pH, PCA was applied strictly to the predictor variables (i.e., the input features), with the target variable excluded during the transformation phase.
Methodology
The PCA was conducted using the caret and tidyverse packages in R. The procedure included the following key steps:
- Standardization: Each feature was Yeo-Johnson transformed and center-scaled to unit variance to ensure that PCA was not biased toward features with larger scales.
# Exclude target
<- train_data_pp |>
predictors select(-PH) |>
select(where(is.numeric))
# Preprocess: transform and standardize
<- preProcess(predictors,
predictors_prep method = c("YeoJohnson", "center",
"scale", "medianImpute"))
# Transform data using PCA
<- predict(predictors_prep, predictors) processed_data
- PCA Transformation: Principal components were extracted from the standardized predictor matrix.
# Perform PCA
<- prcomp(processed_data, center = FALSE, scale. = FALSE) pca_result
- Component Retention: The number of components to retain was informed by a combination of cumulative variance explained and visual inspection via a scree plot.
# Extract the variance explained from the PCA model
<- summary(pca_result)$importance[2,]
var_explained
# Create the scree plot data frame. Need to factor PCs so they populate the
# plot correctly
<- data.frame(
scree_df PC = factor(paste0("PC", 1:length(var_explained)), levels = paste0("PC", 1:length(var_explained))),
Variance = var_explained
)
# Filter to simplify visualization
<- scree_df[1:15, ]
scree_df_small
# Plot
ggplot(scree_df_small, aes(x = PC, y = Variance)) +
geom_smooth(aes(group = 1), color = "darkblue", linewidth = 1, se = FALSE) +
ggtitle("Scree Plot") +
xlab("Principal Components") +
ylab("Proportion of Variance Explained") +
theme_minimal()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Results and Interpretation
We can analyze the graph above as a heuristic approach for determining the number of Principle Components to keep. The proportion of variance explained starts to taper off at PC6, so we would consider the first five PCs as the most important for modeling. Components beyond this contribute marginally to the variance. Furthermore, we can examine the cumulative variance for each PC we add.
<- summary(pca_result)$importance[3,]
cumulative_variance
$Cumulative_Variance <- cumulative_variance
scree_df
# Print table
|>
scree_df as_tibble() |>
filter(row_number() < 16) |>
kable()
PC | Variance | Cumulative_Variance |
---|---|---|
PC1 | 0.20922 | 0.20922 |
PC2 | 0.16138 | 0.37060 |
PC3 | 0.07006 | 0.44066 |
PC4 | 0.05272 | 0.49337 |
PC5 | 0.04738 | 0.54075 |
PC6 | 0.04389 | 0.58464 |
PC7 | 0.04048 | 0.62512 |
PC8 | 0.03662 | 0.66174 |
PC9 | 0.03474 | 0.69648 |
PC10 | 0.03118 | 0.72766 |
PC11 | 0.02893 | 0.75659 |
PC12 | 0.02553 | 0.78212 |
PC13 | 0.02519 | 0.80731 |
PC14 | 0.02249 | 0.82981 |
PC15 | 0.02216 | 0.85196 |
From this output, we observe that:
The first few components capture a substantial portion of the total variance.
Loadings and Interpretability
The rotation matrix provides the loadings of each original variable on the principal components. Loadings close to ±1 indicate strong influence, while values near 0 indicate minimal contribution.
$rotation[,1:5] |>
pca_resultas.data.frame() |>
arrange(desc(abs(PC1)))
PC1 PC2 PC3 PC4
`Alch Rel` 0.351917110 0.018637006 0.015947644 -0.021509088
Balling 0.350242358 -0.038655553 0.010768884 -0.051510764
`Balling Lvl` 0.347587056 0.018036175 0.058019202 -0.060589500
Density 0.342556769 0.003647185 -0.002780885 -0.008220555
`Carb Rel` 0.327287394 0.040790856 0.033190392 -0.055873032
`Carb Volume` 0.312671225 -0.010849527 0.077849439 0.014418401
`Brand Code`D 0.309509885 0.024613795 -0.030268525 0.036554775
`Brand Code`B -0.281886397 -0.018590616 -0.081590948 0.205154812
`Hyd Pressure4` -0.216438550 -0.010538108 0.259170039 -0.190256529
`Carb Pressure` 0.185499135 0.002969656 0.036219512 0.321843440
`Brand Code`A 0.102964974 -0.011611054 0.125500991 -0.121825267
`Pressure Setpoint` -0.097974309 -0.253715282 0.097041709 0.017762706
Temperature -0.085593182 0.073031414 0.238809759 -0.191529539
`Brand Code`C -0.080867400 0.007252593 0.040045941 -0.239356201
`PC Volume` -0.070027461 0.044363458 -0.300849404 -0.245631862
`Fill Pressure` -0.055001281 -0.243216724 -0.006425722 0.067604714
`Air Pressurer` -0.041131334 0.025209215 0.098674449 0.289146623
`Hyd Pressure2` 0.034053272 -0.356389157 -0.234508890 -0.128558271
`Hyd Pressure3` 0.033357301 -0.371170916 -0.209235867 -0.097374423
`Oxygen Filler` -0.033055156 0.197534661 -0.049257586 -0.014325367
`Pressure Vacuum` -0.032736654 0.268309580 -0.008011089 0.199000523
`Fill Ounces` -0.031395082 0.003712749 0.119973827 0.148610711
PSC -0.027814800 -0.009774838 -0.058858968 -0.040354680
`Mnf Flow` 0.024485822 -0.370246451 0.074307585 -0.015327355
`PSC CO2` -0.024235975 -0.017887178 0.046661968 0.026701094
`Carb Temp` 0.023503629 0.009142586 -0.001680039 0.344901502
`Filler Speed` 0.023311543 -0.069554166 -0.401263372 0.258147929
`Usage cont` 0.020636458 -0.235304698 0.168029336 0.032977655
`Bowl Setpoint` 0.019544205 0.293735628 -0.180089270 -0.265432969
MFR 0.019144953 0.014987724 -0.105029956 0.157232046
`Hyd Pressure1` 0.015014705 -0.293051774 -0.294238612 -0.220849783
`Carb Flow` -0.012178379 0.078728205 -0.353299056 0.232580514
`Filler Level` 0.009614800 0.297688391 -0.140515050 -0.270964806
`PSC Fill` -0.008729149 0.020514009 0.031714068 -0.013912533
`Carb Pressure1` 0.006908870 -0.148667635 0.370498100 0.012035387
PC5
`Alch Rel` -0.0333321566
Balling 0.0012498256
`Balling Lvl` 0.0180168286
Density 0.0169572081
`Carb Rel` 0.0042927144
`Carb Volume` -0.1089099295
`Brand Code`D -0.0990504201
`Brand Code`B -0.1349954273
`Hyd Pressure4` 0.1670885311
`Carb Pressure` 0.4985690417
`Brand Code`A 0.1690738712
`Pressure Setpoint` 0.0888629414
Temperature 0.2261451642
`Brand Code`C 0.1686322556
`PC Volume` 0.2243499330
`Fill Pressure` 0.0701692254
`Air Pressurer` -0.0159749120
`Hyd Pressure2` 0.0620873076
`Hyd Pressure3` 0.0648982399
`Oxygen Filler` 0.0873573908
`Pressure Vacuum` -0.0831161164
`Fill Ounces` -0.0698995498
PSC -0.0053777417
`Mnf Flow` -0.0175941336
`PSC CO2` 0.0701332075
`Carb Temp` 0.6153834830
`Filler Speed` -0.1399971150
`Usage cont` -0.1300730429
`Bowl Setpoint` 0.0579414364
MFR -0.1818287321
`Hyd Pressure1` 0.1129535858
`Carb Flow` 0.0651455076
`Filler Level` 0.0526725782
`PSC Fill` 0.0004411326
`Carb Pressure1` -0.0783446680
By examining the loading structure:
We can interpret PC1 as a linear combination emphasizing variables such as Alch Rel
Balling
and Density
.
Components with clear thematic groupings (e.g., all chemistry variables, or all environmental sensors) enhance interpretability and suggest latent structures in the data.
Preprocessing
We have discovered much and have some work to do. Some preprocessing we will apply to our training and testing datasets to be used for all regression models. Other preprocessing, such as centering, scaling and PCA will be applied more selectively per model.
- We have a fair amount of correlation.
- We have
NA
’s in our target variable. - We have predictors with varying amounts of missing data.
- We have a categorical variable.
- And we have some data that is not
NA
, but looks as if it should be.
First, we have already dealt with our categorical data in train_data_pp
and test_data_pp
. We shall use them as a starting place for further preprocessing.
We will deal with our missing PH
data. We don’t want to train our models to predict NA
. So we are going to drop the four rows missing PH
values.
<- train_data_pp |>
train_data_pp filter(!is.na(`PH`))
# test_data_pp is already void of PH values
And now we will remove highly correlated variables. We will use a correlation threshold of 0.95. While 0.75 is a common starting point, retaining more features helps maintain signal for models like PLS, which can internally handle correlated predictors by extracting latent factors.
<- train_data_pp |>
high_corr select(-PH) |>
cor(use = "pairwise.complete.obs") |>
findCorrelation(cutoff = 0.95)
# print the names of the highly correlated variables
glue("Removing columns: {colnames(train_data_pp)[high_corr]}")
Removing columns: Balling
# remove the highly correlated variables
<- train_data_pp[, -high_corr]
train_data_pp <- test_data_pp[, -high_corr] test_data_pp
For missing data we will impute, but the suggested methods of imputation for different models varies, so we are going to leave that processing to be done at training time.
Our line plots revealed some data that is suspiciously constant over many consecutive observations: Mnf Flow
, Hyd Pressure1
, Hyd Pressure2
, and Hyd Pressure3
. We suspect that they are effectively missing data. Do we want to do any of the following?
- Leave them as they are.
- Remove the variables altogether.
- Set the suspicious observations to
NA
and impute them.
We are going to leave them as they are. We don’t know enough about the data to feel confident in assuming they are missing data. We are going to assume that they are legitimate values.
<- train_data_pp |>
train_data_pp mutate(
# convert the NAs encoded as numbers into proper NAs.
`Mnf Flow` = if_else(`Mnf Flow` < 0, NA, `Mnf Flow`),
# For those with legitimate 0 values, this is flawed.
`Hyd Pressure1` = if_else(`Hyd Pressure1` == 0, NA, `Hyd Pressure1`),
`Hyd Pressure2` = if_else(`Hyd Pressure2` == 0, NA, `Hyd Pressure2`),
`Hyd Pressure3` = if_else(`Hyd Pressure3` == 0, NA, `Hyd Pressure3`)
)
Regression Workshop
This is our little workshop of utilities. You will see below in preprocess
that we don’t actually partition our data. Our dataset being somewhat small, we have decided to use cross validation to test the performance of our models. We think it is better because it uses different combinations of data to train and test our models. If we partition our training dataset then we are compromising our training and testing in one of two mutually exclusive ways. If we chose to partition it and not use cross validation, then we are training our model with only one snapshot of our data, which could lead to bias. If we chose to partition it and use cross validation, then we are training it with an even smaller dataset. We feel we have everything to gain by not partitioning the data and nothing to lose.
preprocess()
is a simple utility that sets the random seed in the hopes that all models will use the same cross validated sequence of data. It extracts y
from data
, and applies the preprocessing methods
to the predictors. It returns a list including the preprocessed data, the preprocess model, and the response variable.
<- function(data, methods) {
preprocess set.seed(31415)
# separate the data into predictors and targets
<- dplyr::select(data, -PH)
X <- data$PH
y
# preprocess the data
<- preProcess(X, method = methods)
model <- predict(model, X)
X list(
PPM = model,
X = X,
y = y
) }
report()
reports accuracy with MAE, RMSE and R2. It also reports the most important variables in descending order.
<- function(model) {
report = mean(model$results$MAE)
mae = mean(model$results$RMSE)
rmse = mean(model$results$Rsquared)
r2 glue("{model$method}: MAE={round(mae, 2)}, RMSE={round(rmse, 2)}, R2={round(r2, 2)}") |>
print()
glue("Best tuned parameters: {colnames(model$bestTune)} = {model$bestTune}") |>
print()
print(varImp(model))
}
ctrl
is our directive to the training process to use cross validation to train and evaluate our models. Most models will use this control method. Some models will use a less intense cross validation method. For example, Random Forest takes a long time to run with our default parameters. We are cutting down on cross validation for Random Forest to 2 folds and 5 repeats.
<- trainControl(method = "repeatedcv", number = 10, repeats = 5) ctrl
Linear Regression
Refined Preprocessing
<- preprocess(
data
train_data_pp,method = c("knnImpute", "YeoJohnson", "center", "scale")
)
Train Partial Least Squares (PLS) Model
<- train(
pls_model x = data$X,
y = data$y,
method = "pls",
tuneLength = 20,
trControl = ctrl
)report(pls_model)
pls: MAE=0.11, RMSE=0.14, R2=0.38
Best tuned parameters: ncomp = 20
pls variable importance
only 20 most important variables shown (out of 34)
Overall
`\\`Mnf Flow\\`` 100.00
`\\`Brand Code\\`C` 80.48
`\\`Usage cont\\`` 64.55
`\\`Bowl Setpoint\\`` 64.05
`\\`Filler Level\\`` 54.85
`\\`Hyd Pressure3\\`` 52.32
Temperature 49.47
`\\`Pressure Setpoint\\`` 49.16
`\\`Hyd Pressure2\\`` 46.01
`\\`Brand Code\\`B` 43.27
`\\`Fill Pressure\\`` 40.36
`\\`Hyd Pressure1\\`` 36.08
`\\`Pressure Vacuum\\`` 35.67
`\\`Carb Pressure1\\`` 33.63
`\\`Carb Flow\\`` 33.53
`\\`Brand Code\\`D` 30.95
`\\`Brand Code\\`A` 29.53
`\\`Oxygen Filler\\`` 29.38
`\\`Hyd Pressure4\\`` 23.45
`\\`Carb Rel\\`` 23.19
Train Ordinary Least Squares (OLS) Model
Ordinary Least Squares (OLS) is a benchmark linear modeling method. We’ve already preprocessed the data in a way which suits both PLS and OLS. We’ll determine which of these two models is best among linear regression - to do that let’s fit an OLS model.
<- train(
ols_model x = data$X,
y = data$y,
method = "lm",
trControl = ctrl
)report(ols_model)
lm: MAE=0.1, RMSE=0.13, R2=0.4
Best tuned parameters: intercept = TRUE
lm variable importance
only 20 most important variables shown (out of 33)
Overall
`\\`Mnf Flow\\`` 100.000
`\\`Carb Pressure1\\`` 61.468
Temperature 38.469
`\\`Usage cont\\`` 37.312
`\\`Bowl Setpoint\\`` 34.030
`\\`Brand Code\\`C` 32.573
`\\`Hyd Pressure3\\`` 31.683
Density 29.763
`\\`Pressure Setpoint\\`` 26.424
`\\`Oxygen Filler\\`` 26.261
`\\`Brand Code\\`A` 18.418
`\\`Alch Rel\\`` 16.912
`\\`Carb Flow\\`` 13.997
`\\`Filler Level\\`` 12.102
`\\`Fill Ounces\\`` 11.678
`\\`PSC CO2\\`` 11.650
`\\`Hyd Pressure2\\`` 11.131
`\\`Carb Rel\\`` 10.193
`\\`PC Volume\\`` 9.572
`\\`PSC Fill\\`` 8.811
OLS is interpretable and useful as a baseline. It assumes linear relationships and independence. It may not perform as well as more complex models if predictors are collinear or relationships are nonlinear.
Train Lasso Regression Model
To further assess whether regularization improves performance, models like Lasso, Ridge, and Elastic Net could be explored. These approaches can reduce overfitting and handle correlated predictors more effectively than OLS. In future work or production deployment, it would be advisable to test these variants and compare their performance using the same repeated cross-validation framework.
<- train(
lasso_model x = data$X,
y = data$y,
method = "lasso",
tuneLength = 20,
trControl = ctrl,
preProcess = NULL
)report(lasso_model)
lasso: MAE=0.11, RMSE=0.14, R2=0.37
Best tuned parameters: fraction = 0.857894736842105
loess r-squared variable importance
only 20 most important variables shown (out of 34)
Overall
`Mnf Flow` 100.000
`Bowl Setpoint` 59.594
`Filler Level` 51.518
`Usage cont` 49.516
`Pressure Setpoint` 46.216
`Carb Flow` 41.604
`Brand Code`C 39.243
`Pressure Vacuum` 26.382
`Hyd Pressure3` 26.289
`Hyd Pressure2` 22.913
`Fill Pressure` 20.184
`Brand Code`D 16.949
`Oxygen Filler` 14.012
`Carb Rel` 12.635
Temperature 11.965
`Hyd Pressure1` 11.618
`Alch Rel` 10.601
`Hyd Pressure4` 10.396
`Brand Code`A 4.905
MFR 4.545
Train Ridge Regression Model
<- train(
ridge_model x = data$X,
y = data$y,
method = "ridge",
tuneLength = 20,
trControl = ctrl,
preProcess = NULL
)report(ridge_model)
ridge: MAE=0.1, RMSE=0.13, R2=0.4
Best tuned parameters: lambda = 0.01
loess r-squared variable importance
only 20 most important variables shown (out of 34)
Overall
`Mnf Flow` 100.000
`Bowl Setpoint` 59.594
`Filler Level` 51.518
`Usage cont` 49.516
`Pressure Setpoint` 46.216
`Carb Flow` 41.604
`Brand Code`C 39.243
`Pressure Vacuum` 26.382
`Hyd Pressure3` 26.289
`Hyd Pressure2` 22.913
`Fill Pressure` 20.184
`Brand Code`D 16.949
`Oxygen Filler` 14.012
`Carb Rel` 12.635
Temperature 11.965
`Hyd Pressure1` 11.618
`Alch Rel` 10.601
`Hyd Pressure4` 10.396
`Brand Code`A 4.905
MFR 4.545
Train Elastic Net Model
<- train(
elastic_model x = data$X,
y = data$y,
method = "glmnet",
tuneLength = 20,
trControl = ctrl,
preProcess = NULL
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
report(elastic_model)
glmnet: MAE=0.11, RMSE=0.14, R2=NaN
Best tuned parameters: alpha = 0.952631578947368
Best tuned parameters: lambda = 0.000335359611608049
glmnet variable importance
only 20 most important variables shown (out of 34)
Overall
`Mnf Flow` 100.000
`Hyd Pressure3` 62.568
`Brand Code`C 50.353
`Bowl Setpoint` 49.733
`Carb Pressure1` 37.185
Density 36.771
`Alch Rel` 32.715
`Usage cont` 24.122
Temperature 21.838
`Pressure Setpoint` 19.312
`Hyd Pressure2` 16.036
`Oxygen Filler` 15.373
`Brand Code`A 14.679
`Filler Level` 13.963
`Carb Rel` 10.869
`Carb Flow` 10.216
`Carb Volume` 9.254
`Fill Pressure` 7.366
`Fill Ounces` 6.815
`Hyd Pressure1` 6.218
We see a warning after running the model for Elastic net that there were missing values in resampled performance measures. Since our model finished training and we still got valid results, and as we will see below our metrics for Elastic Net are close to the others, we will likely not recommend selecting it as our final model unless it offers clear advantages (which we will see that it does not.)
Comparison of OLS, PLS, and Lasso
After training and comparing five linear modeling approaches — OLS, PLS, Lasso, Ridge, and Elastic Net — we find that performance across all methods is remarkably similar. Each model was evaluated using repeated 10-fold cross-validation, and results were compared using RMSE (Root Mean Squared Error), R-squared, and MAE (Mean Absolute Error).
The best RMSE and best R-squared values were achieved by OLS and Elastic Net, but the differences between all models are minuscule (less than 0.0001 RMSE units). The lowest MAE was from OLS at 0.104241, but again, the margin is extremely small.
Given the similarity in performance, we recommend Ordinary Least Squares (OLS) for this use case:
- It is simple, interpretable, and fast to compute.
- It slightly outperformed PLS and Lasso in both MAE and RMSE.
- Its coefficients can be used directly for business insight and explainability.
Elastic Net showed nearly identical performance but triggered a convergence warning during resampling, suggesting mild instability. If regularization becomes important due to changing data conditions or overfitting concerns in future phases, Lasso or Elastic Net could be revisited.
This modeling exercise confirms that pH can be predicted with consistent accuracy using standard linear models. The choice of model does not significantly alter accuracy, allowing the business to prioritize simplicity and transparency. Implementing the OLS model enables stakeholders to:
- Understand which production variables most influence pH
- Monitor predictions in real time using a straightforward model
- Translate coefficients into actionable guidance for technicians
Non Linear Regression
Refined Preprocessing
<- preprocess(train_data_pp, c("center", "scale", "medianImpute", "pca")) data
KNN Model
k-Nearest Neighbors is a non-parametric method used for classification and regression. It works by finding the k-nearest neighbors to a given point and using their values to predict the value of that point.
<- train(
knn_model x = data$X,
y = data$y,
method = "knn",
tuneLength = 20,
trControl = ctrl
)report(knn_model)
knn: MAE=0.1, RMSE=0.13, R2=0.45
Best tuned parameters: k = 9
loess r-squared variable importance
Overall
PC2 100.0000
PC1 32.9145
PC6 30.1517
PC7 27.3098
PC17 16.6429
PC4 13.9365
PC8 13.1987
PC20 11.9676
PC3 9.8122
PC15 8.7371
PC5 8.3752
PC18 7.8995
PC9 7.2142
PC16 7.1215
PC19 6.2098
PC12 3.1794
PC13 2.5477
PC10 1.5031
PC11 0.5908
PC14 0.0000
SVM Model
Support Vector Machines (SVM) is a supervised learning algorithm that can be used for classification or regression. It works by finding the hyperplane that best separates the data into different classes. In this case, we are using it for regression and will try three different SVM models: linear, polynomial, and radial.
Linear SVM Model
<- train(
svml_model x = data$X,
y = data$y,
method = "svmLinear",
tuneLength = 10,
trControl = ctrl
)report(svml_model)
svmLinear: MAE=0.11, RMSE=0.14, R2=0.34
Best tuned parameters: C = 1
loess r-squared variable importance
Overall
PC2 100.0000
PC1 32.9145
PC6 30.1517
PC7 27.3098
PC17 16.6429
PC4 13.9365
PC8 13.1987
PC20 11.9676
PC3 9.8122
PC15 8.7371
PC5 8.3752
PC18 7.8995
PC9 7.2142
PC16 7.1215
PC19 6.2098
PC12 3.1794
PC13 2.5477
PC10 1.5031
PC11 0.5908
PC14 0.0000
Polynomial SVM Model
<- train(
svmp_model x = data$X,
y = data$y,
method = "svmPoly",
tuneLength = 3,
trControl = trainControl(
method = "repeatedcv",
number = 2,
repeats = 5
)
)report(svmp_model)
svmPoly: MAE=0.11, RMSE=0.15, R2=0.34
Best tuned parameters: degree = 3
Best tuned parameters: scale = 0.01
Best tuned parameters: C = 1
loess r-squared variable importance
Overall
PC2 100.0000
PC1 32.9145
PC6 30.1517
PC7 27.3098
PC17 16.6429
PC4 13.9365
PC8 13.1987
PC20 11.9676
PC3 9.8122
PC15 8.7371
PC5 8.3752
PC18 7.8995
PC9 7.2142
PC16 7.1215
PC19 6.2098
PC12 3.1794
PC13 2.5477
PC10 1.5031
PC11 0.5908
PC14 0.0000
Radial SVM Model
<- train(
svmr_model x = data$X,
y = data$y,
method = "svmRadial",
tuneLength = 5,
trControl = ctrl
)report(svmr_model)
svmRadial: MAE=0.09, RMSE=0.12, R2=0.49
Best tuned parameters: sigma = 0.0322972739788174
Best tuned parameters: C = 4
loess r-squared variable importance
Overall
PC2 100.0000
PC1 32.9145
PC6 30.1517
PC7 27.3098
PC17 16.6429
PC4 13.9365
PC8 13.1987
PC20 11.9676
PC3 9.8122
PC15 8.7371
PC5 8.3752
PC18 7.8995
PC9 7.2142
PC16 7.1215
PC19 6.2098
PC12 3.1794
PC13 2.5477
PC10 1.5031
PC11 0.5908
PC14 0.0000
MARS Model
Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression technique that can be used for both linear and non-linear regression. It works by fitting piece-wise linear functions to the data.
We are using a less intensive cross validation method for MARS because it takes a long time to run with our default parameters. We are cutting down on cross validation to 2 folds and 5 repeats.
= expand.grid(
grid degree = 1:2,
nprune = 5:10
)<- train(
mars_model x = data$X,
y = data$y,
method = "earth",
tuneGrid = grid,
trControl = trainControl(
method = "repeatedcv",
number = 2,
repeats = 5
)
)report(mars_model)
earth: MAE=0.11, RMSE=0.14, R2=0.3
Best tuned parameters: nprune = 10
Best tuned parameters: degree = 2
earth variable importance
Overall
PC2 100.000
PC1 71.570
PC6 71.570
PC7 52.247
PC17 28.489
PC18 24.062
PC13 16.917
PC11 9.624
PC15 9.624
PC16 0.000
Neural Network Model
Neural Networks are a class of models that are inspired by the way the human brain works. They are used for both classification and regression tasks. In this case, we are using it for regression.
As with MARS, we are using a less intensive cross validation method for Neural Networks because it takes a long time to run with our default parameters. We are cutting down on cross validation to 2 folds and 5 repeats.
<- expand.grid(
grid size = c(2, 4, 6, 8, 10),
decay = c(0, 0.05, 0.1, 0.15)
)<- train(
nn_model x = data$X,
y = data$y,
method = "nnet",
linout = TRUE,
trace = FALSE,
maxit = 1000,
tuneGrid = grid,
trControl = trainControl(
method = "repeatedcv",
number = 2,
repeats = 5
)
)report(nn_model)
nnet: MAE=0.11, RMSE=0.17, R2=0.38
Best tuned parameters: size = 6
Best tuned parameters: decay = 0.1
nnet variable importance
Overall
PC20 100.00
PC6 97.17
PC2 93.23
PC19 82.67
PC7 82.65
PC16 80.56
PC17 68.44
PC4 68.33
PC1 58.24
PC18 54.43
PC10 44.01
PC13 34.40
PC15 33.81
PC8 32.10
PC5 30.84
PC3 28.67
PC11 28.28
PC14 18.44
PC12 10.78
PC9 0.00
Non Linear Summary
Of the non-linear models, the SVM radial model performed best with an R-squared of 0.485, followed closely by KNN with an R-squared of 0.455. Somewhat surprisingly, the MARS model had the lowest R-squared of the non linear bunch of 0.302, while the Neural Network model had an R-squared of 0.384.
And the MAE and RMSE, for the most part, reinforce R-squared’s story. The non-linear champion, SVM radial model, had the lowest MAE of 0.093 and lowest RMSE of 0.125. While, at the other extreme, MARS had the highest MAE of 0.114 and 2nd highest RMSE of 0.144.
Decision Trees
Refined Preprocessing
<- preprocess(train_data_pp, "medianImpute")
rpart_data # column names with backticks are causing issues with the caret package.
$X <- clean_names(rpart_data$X) rpart_data
Recursive Partitioning Model
<- train(
rpart_model x = rpart_data$X,
y = rpart_data$y,
method = "rpart",
tuneLength = 10,
trControl = ctrl
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
report(rpart_model)
rpart: MAE=0.11, RMSE=0.14, R2=0.34
Best tuned parameters: cp = 0.0136301352354267
rpart variable importance
only 20 most important variables shown (out of 34)
Overall
mnf_flow 100.000
temperature 88.758
usage_cont 80.417
filler_speed 76.020
balling_lvl 69.764
fill_pressure 51.009
bowl_setpoint 48.788
hyd_pressure3 48.601
filler_level 45.544
air_pressurer 45.273
pressure_vacuum 41.667
brand_code_c 40.155
carb_flow 36.453
pressure_setpoint 32.795
carb_pressure1 32.761
carb_rel 29.118
oxygen_filler 28.459
alch_rel 20.989
brand_code_d 18.941
carb_volume 7.517
Visualize the model
rpart.plot(
$finalModel,
rpart_modeltype = 2,
extra = 101,
fallen.leaves = TRUE,
main = "Decision Tree for pH Prediction"
)
The recursive partitioning model was built using the rpart method with 10 levels of complexity parameter (cp) tuning. The optimal cp value found was 0.0136, balancing tree complexity and performance. While the decision tree offers good interpretability, its predictive power was moderate, with an Rsquared of 0.34 indicating only limited explanatory capability for the variance in the response variable. The visualization of the decision tree helps understand how key variables influence the pH prediction.
The top three predictors identified were: mnf_flow (100% importance), brand_code_c (92.93) and pressure_vacuum (88.75).
Random Forest Model
<- rpart_data
rf_data
# Random forest takes a very long time to run with our default parameters.
# We are cutting down on cross validation
<- train(
rf_model x = rf_data$X,
y = rf_data$y,
importance = TRUE,
trControl = trainControl(
method = "repeatedcv",
number = 2,
repeats = 5
)
)report(rf_model)
rf: MAE=0.08, RMSE=0.11, R2=0.63
Best tuned parameters: mtry = 18
rf variable importance
only 20 most important variables shown (out of 34)
Overall
mnf_flow 100.00
brand_code_c 92.93
pressure_vacuum 88.75
oxygen_filler 77.26
balling_lvl 65.47
air_pressurer 61.98
temperature 59.73
alch_rel 58.12
carb_rel 58.11
usage_cont 54.06
density 52.87
hyd_pressure3 51.18
filler_speed 48.47
carb_flow 45.49
carb_pressure1 44.49
hyd_pressure1 43.02
bowl_setpoint 40.53
hyd_pressure2 38.17
fill_pressure 31.83
filler_level 30.07
Visualize the model
varImpPlot(
$finalModel,
rf_modeltype = 1,
main = "Random Forest Variable Importance"
)
The Random Forest outperformed the decision tree, nearly doubling the explanatory power (Rsquared of 0.63). It provided much stronger predictive performance, suggesting it is better suited for this dataset despite longer training time and reduced interpretability.
Top predictors included: mnf_flow (100% importance), consistent with the decision tree.
Model Comparison
Having used cross-validation to build our models, they are now populated with \(results\). Here we shall take a look at some of those summary statistics and compare one model to another.
<- list(
model_list PLS = pls_model,
OLS = ols_model,
Lasso = lasso_model,
Ridge = ridge_model,
ElasticNet = elastic_model,
KNN = knn_model,
SVM_linear = svml_model,
SVM_poly = svmp_model,
SVM_radial = svmr_model,
MARS = mars_model,
NeuralNet = nn_model,
RPart = rpart_model,
RandomForest = rf_model
)
# extract the results from each model
<- tibble(
model_results Model = names(model_list),
MAE = map(
function(x) mean(x$results$MAE, na.rm = TRUE)
model_list, |> as_vector(),
) RMSE = map(
function(x) mean(x$results$RMSE, na.rm = TRUE)
model_list, |> as_vector(),
) Rsquared = map(
function(x) mean(x$results$Rsquared, na.rm = TRUE)
model_list, |> as_vector()
) )
|>
model_results arrange(desc(Rsquared), MAE, RMSE) |>
mutate(
MAE = round(MAE, 3),
RMSE = round(RMSE, 3),
Rsquared = round(Rsquared, 3)
|>
) kable()
Model | MAE | RMSE | Rsquared |
---|---|---|---|
RandomForest | 0.080 | 0.108 | 0.626 |
SVM_radial | 0.093 | 0.125 | 0.485 |
KNN | 0.099 | 0.128 | 0.455 |
OLS | 0.104 | 0.134 | 0.396 |
Ridge | 0.104 | 0.134 | 0.396 |
NeuralNet | 0.108 | 0.172 | 0.381 |
PLS | 0.106 | 0.136 | 0.380 |
ElasticNet | 0.109 | 0.139 | 0.370 |
Lasso | 0.108 | 0.138 | 0.370 |
RPart | 0.111 | 0.141 | 0.340 |
SVM_poly | 0.113 | 0.148 | 0.338 |
SVM_linear | 0.110 | 0.141 | 0.337 |
MARS | 0.114 | 0.144 | 0.302 |
The Champion
A champion, ideally, is a model with the lowest MAE, the lowest RMSE, and the highest R2. And we have such a champion. Please allow us to present the champion of our modeling competition: Random Forest. It is simple, moderately interpretable, but not very fast to compute.
Among all tested models, the Random Forest model did demonstrate the best overall performance, achieving the lowest MAE (0.080), lowest RMSE (0.108), and highest r squared (0.626), indicating superior predictive accuracy and explanatory power.
The SVM radial and KNN models followed, with moderate performance (r squared of 0.485 and 0.455, respectively), but did not match the precision of Random Forest. Linear models such as OLS, Ridge, and Elastic Net showed similar results, with r squared values clustered around 0.37–0.40.
The Neural Network, PLS, Lasso, and RPart (Decision Tree) models exhibited slightly weaker performance, with r squared values below 0.40. The MARS model had the lowest performance overall (r squared of 0.302).
In conclusion, Random Forest is the recommended model for this dataset due to its balance of accuracy and robustness, despite being more complex and computationally intensive.
Final Predictions
Finally, we will make predictions using our champion model, Random Forest. We will use the test data to make predictions and save them to a CSV file.
# preprocess the data with the same model with which we preprocessed the training model
<- predict(rf_data$PPM, test_data_pp)
df # the variable names in our training set caused problems with `randomForest` so we
# "cleaned" them. The model is expecting the same variables.
<- clean_names(df)
df data.frame(
SampleID = 1:nrow(df),
Predicted_pH = predict(rf_model, newdata = df)
|>
) write.csv(
"Final_Predictions.csv",
row.names = FALSE
)