Project 2

Instructions

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

We start by loading relevant libraries for data manipulation, visualization, imputation, and modeling.

# Load required libraries
library(tidyverse)    #

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(mice)

## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(corrplot)

## corrplot 0.95 loaded

library(randomForest)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(gbm)

## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(nnet)         
library(Cubist)       
library(openxlsx)     
library(ggpubr)       
library(viridis)

## Loading required package: viridisLite

library(hrbrthemes)   
library(e1071)        
library(DT)
library(kernlab)

## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:mice':
## 
##     convergence
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(earth)

## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix

library(rpart)
library(dplyr)
library(ggplot2)
library(tibble)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:randomForest':
## 
##     combine
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(patchwork)

Load datasets

Load datasets and substitute any empty values with NA values to facilitate the imputation of missing data in the future.

df_StudentData <- read.csv('https://raw.githubusercontent.com/uplotnik/DATA-624/refs/heads/main/StudentData.csv', na.strings = c("", NA))
df_EvalData <- read.csv('https://raw.githubusercontent.com/uplotnik/DATA-624/refs/heads/main/StudentEvaluation.csv', na.strings = c("", NA))

{r-1} #Check first rows of beverage data DT::datatable( df_StudentData[1:10,], options = list(scrollX = TRUE, deferRender = TRUE, dom = 'lBfrtip', fixedColumns = TRUE, info = FALSE, paging=FALSE, searching = FALSE), rownames = FALSE, caption = htmltools::tags$caption( style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;', 'Table 1: First 10 Rows of Beverage Data' ))

{r-2} DT::datatable( df_EvalData[1:10,], options = list(scrollX = TRUE, deferRender = TRUE, dom = 'lBfrtip', fixedColumns = TRUE, info = FALSE, paging=FALSE, searching = FALSE), rownames = FALSE, caption = htmltools::tags$caption( style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;', 'Table 2: First 10 Rows of Evaluation Data' )) ## Check dataset dimensions and glimpse for quick overview

# Finding data dimensions.
dims <- data.frame("Train" =  dim(df_StudentData),
                   "Eval" = dim(df_EvalData))
rownames(dims) <- c("Observations","Predictors")
dims

##              Train Eval
## Observations  2571  267
## Predictors      33   33

The Training set contains a total of 2,571 observations and 33 predictors, including PH, as shown in the table above. Additionally, the Evaluation set consists of 267 observations, also with 33 predictors, including PH.

glimpse(df_StudentData)

## Rows: 2,571
## Columns: 33
## $ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4     <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed      <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow         <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…

Exploratory Data Analysis

Visualizations like histograms and boxplots will be used to explore the distributions of numeric variables.

# Histograms for numeric variables

df_StudentData %>%

  select_if(is.numeric) %>%

  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%

  ggplot(aes(x = value)) +

  geom_histogram(aes(y=..density..), bins = 15, fill = "skyblue", alpha = 0.7, color = "black") +

  geom_density(color = "red", size = 1) +

  facet_wrap(~variable, scales = "free") +

  labs(title = "Histograms and Density Plots of Numeric Variables", x = "Value", y = "Density") +

  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_density()`).

The training data shows different distribution patterns for the variables::

Relatively Normal Distributions: Carb.Pressure, Carb.Temp, Fill.Ounces, PC.Volume, PH (response variable)

Left-skew Distributions: Carb.Flow, Filler.Speed, Mnf.Flow, MFR, Bowl.Setpoint, Filler.Level, Hyd.Pressure2, Hyd.Pressure3, Usage.cont, Carb.Pressure1, Filler.Speed

Right-skew Distributions: Pressure.Setpoint, Fill.Pressure, Hyd.Pressure1, Temperature, Carb.Volume, PSC, PSC.CO2, PSC.Fill, Balling, Density, Hyd.Pressure4, Air.Pressurer, Alch.Rel, Carb.Rel, Oxygen.Filler, Balling.Lvl, Pressure.Vacuum

# Boxplots for numeric variables

df_StudentData %>%

  select_if(is.numeric) %>%

  gather(key = "variable", value = "value") %>%

  ggplot(aes(x = variable, y = value)) +

  geom_boxplot(fill = "orange", alpha = 0.7) +

  coord_flip() +

  ggtitle("Boxplots of Numerical Predictors")

## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Data Preparation

Convert Brand.Code to factor

We will transform categorical variable Brand.Code into factors and visualize for proportional distribution.

# Convert Brand.Code to factor
df_StudentData$Brand.Code <- as.factor(df_StudentData$Brand.Code)

df_EvalData$Brand.Code <- as.factor(df_EvalData$Brand.Code)

# Distribution of Brand.Code

df_StudentData %>%

  count(Brand.Code) %>%

  mutate(prop = n / sum(n)) %>%

  ggplot(aes(x = Brand.Code, y = prop, fill = Brand.Code)) +

  geom_col(show.legend = FALSE) +

  geom_text(aes(label = scales::percent(prop, accuracy = 0.1)), vjust = -0.5) +

  labs(title = "Proportion Distribution of Brand.Code", y = "Proportion", x = "Brand.Code") +

  scale_y_continuous(labels = scales::percent_format()) +

  theme_minimal()

Missing values imputation

The next phase tackles missing data and data quality issues. To address missing data, the Multiple Imputation by Chained Equations (MICE) method will be applied, filling in absent entries with statistically plausible values while maintaining the dataset’s integrity. After imputation, variables with near-zero variance will be removed to reduce noise.

# Impute missing values using mice with predictive mean matching (pmm)

df_StudentData_imp <- mice(df_StudentData, m = 1, method = 'pmm', printFlag = FALSE) %>% complete()

# Check if any missing values left

df_StudentData_imp %>%

  summarise_all(~sum(is.na(.))) %>%

  gather(variable, missing) %>%

  filter(missing != 0) %>%

  kable() %>%

  kable_styling()

variable	missing
NA	NA
:——–	——-:

Remove near-zero variance variables

# Remove near-zero variance variables

nzv_vars <- nearZeroVar(df_StudentData_imp)

if(length(nzv_vars) > 0) {

  df_StudentData_imp <- df_StudentData_imp[, -nzv_vars]

}

Calculate skewness for numeric variables

Now we will identify highly skewed features and apply Box-Cox transformation to improve normality and model performance.

# Identify numeric columns (excluding target and factors)

numeric_vars <- df_StudentData_imp %>%

   select_if(is.numeric) %>%

  colnames()

# Calculate skewness for numeric variables

skew_vals <- sapply(df_StudentData_imp[, numeric_vars], skewness)
skew_vals

##       Carb.Volume       Fill.Ounces         PC.Volume     Carb.Pressure 
##       0.389873843      -0.034590579       0.354071811       0.215376139 
##         Carb.Temp               PSC          PSC.Fill           PSC.CO2 
##       0.296494233       0.848838067       0.935542472       1.715529863 
##          Mnf.Flow    Carb.Pressure1     Fill.Pressure     Hyd.Pressure2 
##       0.005689775       0.057770942       0.523934235      -0.300342752 
##     Hyd.Pressure3     Hyd.Pressure4      Filler.Level      Filler.Speed 
##      -0.316740801       0.558967450      -0.847907478      -2.528383956 
##       Temperature        Usage.cont         Carb.Flow           Density 
##       2.405078855      -0.534963587      -0.986833360       0.525014517 
##               MFR           Balling   Pressure.Vacuum                PH 
##      -2.815459447       0.594525892       0.525660793      -0.288202487 
##     Oxygen.Filler     Bowl.Setpoint Pressure.Setpoint     Air.Pressurer 
##       2.633926999      -0.974431771       0.208428660       2.252105286 
##          Alch.Rel          Carb.Rel       Balling.Lvl 
##       0.889026247       0.502842073       0.586438699

# Threshold for high skewness

skew_threshold <- 1

# Variables to transform (highly skewed)

vars_to_transform <- names(skew_vals[abs(skew_vals) > skew_threshold])

print("Highly skewed variables:")

## [1] "Highly skewed variables:"

print(vars_to_transform)

## [1] "PSC.CO2"       "Filler.Speed"  "Temperature"   "MFR"          
## [5] "Oxygen.Filler" "Air.Pressurer"

# Apply Box-Cox transformation to variables with high skewness

for (var in vars_to_transform) {

  # Extract variable vector

  x <- df_StudentData_imp[[var]]

  # Shift data if zero or negatives exist (Box-Cox requires positive values)

  min_x <- min(x, na.rm = TRUE)

  shift <- 0

  if (min_x <= 0) {

    shift <- abs(min_x) + 1

    x <- x + shift

    message(paste("Shifted", var, "by", shift, "to make positive for Box-Cox"))

  }

  # Estimate Box-Cox transformation

  bc_obj <- BoxCoxTrans(x)

  # Transform the data using estimated lambda

  x_bc <- predict(bc_obj, x)

  # Replace original variable with transformed variable (without shift to keep consistent)

  df_StudentData_imp[[var]] <- x_bc

}

## Shifted PSC.CO2 by 1 to make positive for Box-Cox

skew_vals_after_bc <- sapply(df_StudentData_imp[, vars_to_transform], skewness)
print("Skewness after Box-Cox transformation:")

## [1] "Skewness after Box-Cox transformation:"

print(skew_vals_after_bc)

##       PSC.CO2  Filler.Speed   Temperature           MFR Oxygen.Filler 
##     1.2705209    -2.3512816     1.8765566    -2.3683726    -0.1220803 
## Air.Pressurer 
##     2.1883574

Cap outliers at 1st and 99th percentiles in all numeric variables

In this step outliers in numeric variables are capped at the 1st and 99th percentiles to mitigate their impact.

for (var in numeric_vars) {

  lower_bound <- quantile(df_StudentData_imp[[var]], 0.01, na.rm = TRUE)

  upper_bound <- quantile(df_StudentData_imp[[var]], 0.99, na.rm = TRUE)

  df_StudentData_imp[[var]] <- ifelse(df_StudentData_imp[[var]] < lower_bound, lower_bound, df_StudentData_imp[[var]])

  df_StudentData_imp[[var]] <- ifelse(df_StudentData_imp[[var]] > upper_bound, upper_bound, df_StudentData_imp[[var]])

}

# Prepare the data for correlation analysis
cor_data <- df_StudentData_imp %>%
    select_if(is.numeric)

# Compute correlations between 'PH' and all other predictors
corr_values <- cor_data %>%
    summarise(across(.cols = everything(), 
                     .fns = ~ cor(., cor_data$PH, use = "complete.obs"), 
                     .names = "cor_{col}")) %>%
    pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Correlation") %>%
    mutate(Predictor = gsub("cor_", "", Predictor)) %>%
    filter(Predictor != "PH") %>% 
    arrange(desc(abs(Correlation)))


print(corr_values)

## # A tibble: 30 × 2
##    Predictor         Correlation
##    <chr>                   <dbl>
##  1 Mnf.Flow               -0.449
##  2 Bowl.Setpoint           0.353
##  3 Filler.Level            0.330
##  4 Usage.cont             -0.322
##  5 Pressure.Setpoint      -0.312
##  6 Hyd.Pressure3          -0.239
##  7 Pressure.Vacuum         0.219
##  8 Fill.Pressure          -0.216
##  9 Hyd.Pressure2          -0.204
## 10 Oxygen.Filler           0.202
## # ℹ 20 more rows

We analyzed how pH relates to all the factors we have to get a basic idea of what might affect it. By using correlation and visual tools, we pinpoint the variables that are most closely linked to pH. The correlation table ranks these predictors by how much they relate to pH. Mnf.Flow has the strongest negative correlation with pH at -0.44, while Bowl.Setpoint and Filler.level show a positive correlation of 0.35 and 0.33 respectively, suggesting it could influence pH, though these correlations are only moderate.

Data Split

Finally, the data is split into training and testing sets with stratification maintained for the target variable PH. Numeric predictors except the target are scaled using centering and scaling methods to prepare them for machine learning algorithms sensitive to feature scaling.

set.seed(100)
index <- createDataPartition(df_StudentData_imp$PH, p = 0.8, list = FALSE)
train_data <- df_StudentData_imp[index, ]
test_data <- df_StudentData_imp[-index, ]

# Separate predictors and target
train_x <- train_data %>% select(-PH)
train_y <- train_data$PH
test_x <- test_data %>% select(-PH)
test_y <- test_data$PH

# Distribution of target variable in train/test sets to check stratification
train_y_df <- data.frame(PH = train_y)
test_y_df <- data.frame(PH = test_y)

p1 <- ggplot(train_y_df, aes(x=PH)) +

  geom_histogram(bins = 20, fill = "steelblue", alpha=0.7) +

  labs(title = "Distribution of Target Variable PH in Training Set", x = "PH", y = "Count") +

  theme_minimal()

p2 <- ggplot(test_y_df, aes(x=PH)) +

  geom_histogram(bins = 20, fill = "tomato", alpha=0.7) +

  labs(title = "Distribution of Target Variable PH in Test Set", x = "PH", y = "Count") +

  theme_minimal()

ggarrange(p1, p2, ncol = 2, nrow = 1)

# Scaling numeric predictors for algorithms sensitive to scale (linear regression, neural networks, k-NN, SVMs, and gradient boosting can benefit from scaling)

scale_vars <- numeric_vars[numeric_vars != "PH"]

preProcValues <- preProcess(train_x[, scale_vars], method = c("center", "scale"))

train_x_scaled <- train_x

test_x_scaled <- test_x

train_x_scaled[, scale_vars] <- predict(preProcValues, train_x[, scale_vars])

test_x_scaled[, scale_vars] <- predict(preProcValues, test_x[, scale_vars])

The data sets are ready for model training and further analysis, ensuring that the challenges of missing data, skewness, outliers, and feature scaling have been addressed.

Modeling

# Cross Validation for traning control parameter
ctrl = trainControl(method = "cv", number = 10)

Linear Regression Models

 #Partial Least Squares Model
pls_model  =  train(
  x = train_x_scaled,
  y = train_y,
  method = "pls",  # using the partial least squares 
  preProcess = c("center", "scale"),  # apply the preprocess before we train the data
  tuneLength = 10, # we will try 10 different number of components 
  # this will capture latent variables, hidden data that influence the data
  trControl = ctrl, # 10 fold cross validation 
  metric = "Rsquared" # we will chose the model based on best r  square
)

print(pls_model)

## Partial Least Squares 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1853, 1852, 1852, 1853, 1852, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1488483  0.2255300  0.1191840
##    2     0.1403100  0.3123305  0.1107229
##    3     0.1373732  0.3410571  0.1089411
##    4     0.1354439  0.3601542  0.1074405
##    5     0.1331893  0.3817702  0.1050223
##    6     0.1317218  0.3951933  0.1042545
##    7     0.1310885  0.4007738  0.1035745
##    8     0.1307222  0.4040111  0.1032867
##    9     0.1304504  0.4064257  0.1030221
##   10     0.1302663  0.4080639  0.1029262
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.

plot(pls_model)

pls_preds =  predict(pls_model, newdata = test_x_scaled)

# postResample will compare the performance of the model to the response / dependent variable
# We observe how will the model will we predict our targeted values 
postResample(pls_preds,test_y )

##      RMSE  Rsquared       MAE 
## 0.1318754 0.4218840 0.1023404

# Linear Regression Model 
linear_model  =  train(
  x = train_x_scaled, 
  y = train_y, 
  method = "lm", # linear regression 
  preProcess = c("center", "scale"), 
  trControl = ctrl
)

linear_preds =  predict(linear_model, newdata = test_x_scaled)

# postResample will compare the performance of the model to the response / dependent variable
# We observe how will the model will we predict our targeted values 
postResample(linear_preds,test_y )

##      RMSE  Rsquared       MAE 
## 0.1316810 0.4235459 0.1022819

ridge_model =  train(
  x = train_x_scaled, 
  y = train_y, 
  method = "glmnet", 
  preProcess = c("center", "scale"), 
  # lambda   used to control how strong the penalty will be  for having large coefficients in the data
  # tune grid will  try different lambda strengths 10 times, between lambda value of 0.001 to 1 
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 10)), 
  trControl = ctrl
)

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

## Warning in storage.mode(xd) <- "double": NAs introduced by coercion

ridge_preds =  predict(ridge_model, newdata = test_x_scaled)

## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

# postResample will compare the performance of the model to the response / dependent variable
# We observe how will the model will we predict our targeted values 
postResample(ridge_preds,test_y )

##      RMSE  Rsquared       MAE 
## 0.1407875 0.3414561 0.1086475

models  = list(
  "PLS" = pls_model,
  "Linear Regression" = linear_model,
  "Ridge Regression" = ridge_model
)

results  =  resamples(models)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: PLS, Linear Regression, Ridge Regression 
## Number of resamples: 10 
## 
## MAE 
##                         Min.    1st Qu.    Median      Mean   3rd Qu.      Max.
## PLS               0.09170496 0.09912172 0.1029349 0.1029262 0.1079939 0.1120108
## Linear Regression 0.09622699 0.09911786 0.1014176 0.1023238 0.1041486 0.1125394
## Ridge Regression  0.09380744 0.10392766 0.1083676 0.1072935 0.1122332 0.1152598
##                   NA's
## PLS                  0
## Linear Regression    0
## Ridge Regression     0
## 
## RMSE 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## PLS               0.1179580 0.1232201 0.1314763 0.1302663 0.1353568 0.1440944
## Linear Regression 0.1224552 0.1276641 0.1296093 0.1303815 0.1301457 0.1443996
## Ridge Regression  0.1194299 0.1313287 0.1394779 0.1364337 0.1424246 0.1453651
##                   NA's
## PLS                  0
## Linear Regression    0
## Ridge Regression     0
## 
## Rsquared 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## PLS               0.3258447 0.3890386 0.4017965 0.4080639 0.4055236 0.5131100
## Linear Regression 0.3651700 0.3814479 0.4127260 0.4076011 0.4301732 0.4455593
## Ridge Regression  0.2555076 0.3075141 0.3563952 0.3509876 0.3970450 0.4261903
##                   NA's
## PLS                  0
## Linear Regression    0
## Ridge Regression     0

print(models)

## $PLS
## Partial Least Squares 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1853, 1852, 1852, 1853, 1852, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1488483  0.2255300  0.1191840
##    2     0.1403100  0.3123305  0.1107229
##    3     0.1373732  0.3410571  0.1089411
##    4     0.1354439  0.3601542  0.1074405
##    5     0.1331893  0.3817702  0.1050223
##    6     0.1317218  0.3951933  0.1042545
##    7     0.1310885  0.4007738  0.1035745
##    8     0.1307222  0.4040111  0.1032867
##    9     0.1304504  0.4064257  0.1030221
##   10     0.1302663  0.4080639  0.1029262
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.
## 
## $`Linear Regression`
## Linear Regression 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1854, 1852, 1852, 1852, 1851, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1303815  0.4076011  0.1023238
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## $`Ridge Regression`
## glmnet 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1852, 1852, 1852, 1851, 1852, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.0001  0.1364337  0.3509876  0.1072935
##   0.1112  0.1427770  0.3035516  0.1138032
##   0.2223  0.1460630  0.2815518  0.1167750
##   0.3334  0.1482288  0.2690277  0.1186961
##   0.4445  0.1498728  0.2608668  0.1201583
##   0.5556  0.1512134  0.2551010  0.1213256
##   0.6667  0.1523418  0.2508663  0.1223238
##   0.7778  0.1533146  0.2476284  0.1231978
##   0.8889  0.1541793  0.2449932  0.1239737
##   1.0000  0.1549396  0.2429045  0.1246616
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1e-04.

bwplot(results)

So far, the linear regression model is the best performing model, partial least squares model is close second, ridge regression model underperforms for explaining the data. The metrics that indicate a better performing model is lower root mean square, lower mean absolute error, and r square closer to 1. All the model are vadilated ising the 10 cross validation format to make the comparison between all model fair. The Linear Regression Model has the lowest RMSE and MAE value, the partial least square comes in second and ridge regression lack in explaining accuracy in the prediction of the PH values. Linear model are great for capturing the PH prediction, because our non linear model can over fit and memorize patterns captured in the data.

Decision Trees and Random Forests

##  Decision Tree (rpart)
dt_model <- train(
  x         = train_x_scaled,
  y         = train_y,
  method    = "rpart",      # CART decision tree
  tuneLength= 10,           # tries 10 complexity parameter (cp) values
  trControl = ctrl
)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# Print best cp and performance
print(dt_model)

## CART 
## 
## 2058 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1851, 1852, 1852, 1852, 1852, 1852, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE       
##   0.01282182  0.1276498  0.4330314  0.09948808
##   0.01303364  0.1285402  0.4248283  0.10007602
##   0.01316180  0.1294047  0.4171806  0.10045466
##   0.01399498  0.1313541  0.3987972  0.10301670
##   0.01593940  0.1332985  0.3796895  0.10492764
##   0.02046489  0.1346803  0.3661003  0.10651380
##   0.03692988  0.1377835  0.3357494  0.10961818
##   0.03973808  0.1406953  0.3080366  0.11278558
##   0.06272284  0.1455834  0.2600628  0.11547056
##   0.22761891  0.1606280  0.1912621  0.12847256
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01282182.

plot(dt_model)

# Predict & evaluate
dt_preds <- predict(dt_model, newdata = test_x_scaled)
dt_perf  <- postResample(dt_preds, test_y)
print(dt_perf)

##       RMSE   Rsquared        MAE 
## 0.12628357 0.47003691 0.09911089

##  Random Forest
# using the 'randomForest' backend via caret
rf_model <- train(
  x         = train_x_scaled,
  y         = train_y,
  method    = "rf",         # random forest
  tuneLength= 5,            # tries 5 different mtry values
  trControl = ctrl,
  ntree     = 100           # number of trees to grow
)

# View tuning results
print(rf_model)

## Random Forest 
## 
## 2058 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1853, 1852, 1852, 1853, 1853, 1851, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   MAE       
##    2    0.10925889  0.6275064  0.08389107
##    9    0.09759644  0.6877883  0.07246781
##   16    0.09642121  0.6882673  0.07113942
##   23    0.09552858  0.6911398  0.07017869
##   31    0.09576987  0.6858411  0.07019918
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 23.

plot(rf_model)

# Predict & evaluate
rf_preds <- predict(rf_model, newdata = test_x_scaled)
rf_perf  <- postResample(rf_preds, test_y)
print(rf_perf)

##       RMSE   Rsquared        MAE 
## 0.09840606 0.68210222 0.07210496

##Combine into model comparison

models <- list(
  "PLS"              = pls_model,
  "Linear Regression"= linear_model,
  "Ridge Regression" = ridge_model,
  "Decision Tree"    = dt_model,
  "Random Forest"    = rf_model
)

resamps <- resamples(models)
summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: PLS, Linear Regression, Ridge Regression, Decision Tree, Random Forest 
## Number of resamples: 10 
## 
## MAE 
##                         Min.    1st Qu.     Median       Mean   3rd Qu.
## PLS               0.09170496 0.09912172 0.10293491 0.10292621 0.1079939
## Linear Regression 0.09622699 0.09911786 0.10141757 0.10232378 0.1041486
## Ridge Regression  0.09380744 0.10392766 0.10836756 0.10729350 0.1122332
## Decision Tree     0.09223836 0.09648809 0.10057210 0.09948808 0.1016030
## Random Forest     0.06415761 0.06824707 0.07013189 0.07017869 0.0722784
##                        Max. NA's
## PLS               0.1120108    0
## Linear Regression 0.1125394    0
## Ridge Regression  0.1152598    0
## Decision Tree     0.1050984    0
## Random Forest     0.0750839    0
## 
## RMSE 
##                         Min.    1st Qu.     Median       Mean    3rd Qu.
## PLS               0.11795805 0.12322010 0.13147633 0.13026625 0.13535681
## Linear Regression 0.12245520 0.12766413 0.12960931 0.13038151 0.13014570
## Ridge Regression  0.11942994 0.13132872 0.13947786 0.13643370 0.14242456
## Decision Tree     0.11605796 0.12617333 0.12922030 0.12764982 0.12985640
## Random Forest     0.08695166 0.09127654 0.09631546 0.09552858 0.09835109
##                        Max. NA's
## PLS               0.1440944    0
## Linear Regression 0.1443996    0
## Ridge Regression  0.1453651    0
## Decision Tree     0.1336234    0
## Random Forest     0.1029458    0
## 
## Rsquared 
##                        Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## PLS               0.3258447 0.3890386 0.4017965 0.4080639 0.4055236 0.5131100
## Linear Regression 0.3651700 0.3814479 0.4127260 0.4076011 0.4301732 0.4455593
## Ridge Regression  0.2555076 0.3075141 0.3563952 0.3509876 0.3970450 0.4261903
## Decision Tree     0.3899195 0.4139235 0.4272751 0.4330314 0.4405212 0.5000949
## Random Forest     0.6546417 0.6754967 0.6932514 0.6911398 0.7082237 0.7167613
##                   NA's
## PLS                  0
## Linear Regression    0
## Ridge Regression     0
## Decision Tree        0
## Random Forest        0

bwplot(resamps)

For each model, the table shows the distribution of that metric across the 10 folds—Minimum, 1st Quartile, Median, Mean, 3rd Quartile, and Maximum.

Random Forest has the lowest MAE (around 0.07) and lowest RMSE (around 0.10) and the highest R square (around 0.68).

The decision tree comes in second on RMSE (around 0.13) and R square (around 0.44).

Ridge regression performs worst, with RMSE around 0.14 and R square around 0.35.

On average, the Random Forest predicts pH within 0.07 units of the true lab value, and explains 68% of the natural variation in pH.

The next best model (a single decision tree) is about 50% less accurate, with errors around 0.10–0.13 pH units and explaining only 44% of variance.

Simpler linear methods (regression, PLS, ridge) all cluster around 0.10–0.14 units of error and 35–42% variance explained.

Random Forest is clearly the best tool we tried. Its forecasts are the tightest and most reliable.

We should put the Random Forest model into production for real-time pH prediction.

The other models can serve as “sanity checks,” but none match the RF’s accuracy.

Predicting factors

# Imputing the evaluation set (using the same MICE settings)
eval_imp <- mice(
  df_EvalData,
  m        = 1,
  method   = "pmm",
  maxit    = 5,
  printFlag= FALSE
) %>% 
  complete()

# Align columns and remove any near-zero variance vars from training
#    (so eval_imp has exactly the same columns as df_StudentData_imp)
shared_cols <- intersect(names(df_StudentData_imp), names(eval_imp))
eval_imp    <- eval_imp[, shared_cols]

# Re-apply Box–Cox and outlier capping to eval_imp
#    (since we stored the bc objects and numeric_vars vector)
for(var in vars_to_transform){
  x    <- eval_imp[[var]]
  shift<- ifelse(min(x, na.rm=TRUE)<=0, abs(min(x, na.rm=TRUE))+1, 0)
  x    <- x + shift
  eval_imp[[var]] <- predict(
    BoxCoxTrans(x),
    x
  )
}
for(var in numeric_vars){
  lb <- quantile(eval_imp[[var]], 0.01, na.rm=TRUE)
  ub <- quantile(eval_imp[[var]], 0.99, na.rm=TRUE)
  eval_imp[[var]] <- pmin(pmax(eval_imp[[var]], lb), ub)
}

# Scale with the same preProcValues we fit on the training set
eval_scaled <- eval_imp
eval_scaled[, scale_vars] <- predict(preProcValues, eval_imp[, scale_vars])

pred_rf   <- predict(rf_model,   newdata = eval_scaled)
pred_tree <- predict(dt_model,   newdata = eval_scaled)
pred_lm   <- predict(linear_model, newdata = eval_scaled)

# Combine into one data.frame
eval_results <- bind_cols(
  eval_imp %>% mutate(.row = row_number()), 
  tibble(
    PH_RF     = pred_rf,
    PH_Tree   = pred_tree,
    PH_Linear = pred_lm
  )
)

#  Extract “predictive factors”
imp_rf   <- varImp(rf_model)$importance    %>% rownames_to_column("Variable")
imp_tree <- varImp(dt_model)$importance    %>% rownames_to_column("Variable")
coefs_lm <- broom::tidy(linear_model$finalModel) %>% select(term, estimate)

test_preds <- tibble(
  Actual    = test_y,
  Pred_RF   = predict(rf_model,    newdata = test_x_scaled),
  Pred_Tree = predict(dt_model,    newdata = test_x_scaled),
  Pred_LM   = predict(linear_model,newdata = test_x_scaled)
)

# 2) Compute a common axis range (round to nice numbers)
all_vals <- c(test_preds$Actual,
              test_preds$Pred_RF,
              test_preds$Pred_Tree,
              test_preds$Pred_LM)
lims <- range(all_vals, na.rm=TRUE)
lims <- c(floor(lims[1]*10)/10, ceiling(lims[2]*10)/10)

make_plot <- function(y, title, color, lims) {
  ggplot(test_preds, aes(x = Actual, y = .data[[y]])) +
    geom_point(color = color, alpha = .6) +
    geom_abline(slope=1, intercept=0, linetype="dashed", color="gray50") +
    scale_x_continuous(limits = lims, expand = c(0,0)) +
    scale_y_continuous(limits = lims, expand = c(0,0)) +
    coord_equal() +
    labs(title = title, x = "Actual pH", y = "Predicted pH") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5))
}

# common limits
all_vals <- c(test_preds$Actual,
              test_preds$Pred_RF,
              test_preds$Pred_Tree,
              test_preds$Pred_LM)
lims <- range(all_vals, na.rm = TRUE)
lims <- c(floor(lims[1]*10)/10, ceiling(lims[2]*10)/10)

p_rf   <- make_plot("Pred_RF",   "Random Forest",     "steelblue",  lims)
p_tree <- make_plot("Pred_Tree", "Decision Tree",     "darkgreen",  lims)
p_lm   <- make_plot("Pred_LM",   "Linear Regression", "firebrick",  lims)

combined <- (p_rf + p_tree + p_lm) +
  plot_layout(ncol = 3, widths = c(1,1,1)) +
  plot_annotation(
    title = "Actual vs. Predicted pH Across Models",
    theme = theme(plot.title = element_text(size=16, face="bold"))
  )

print(combined)

- Random Forest: This model delivers the most precise forecasts, so it’s our go-to for real-time pH monitoring and control.

Decision Tree: A single tree is too coarse—it misses small but important pH drifts and isn’t recommended for fine-tuned process control.
Linear Regression: Offers a simple baseline, but its accuracy ( around 0.13 pH) isn’t sufficient for tight quality tolerances.

What moves the pH?

# pull the top 10 most important variables
top10 <- imp_rf %>%
  arrange(desc(Overall)) %>%
  slice(1:10) %>%
  mutate(Variable = factor(Variable, levels = rev(Variable)))

ggplot(top10, aes(x = Variable, y = Overall)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 Drivers of pH (Random Forest)",
    x     = NULL,
    y     = "Variable Importance"
  ) +
  theme_minimal()

Closer look at Predicted pH (Random Forest)

# data.frame of actual vs. RF‐predicted on test split
test_preds <- tibble(
  Actual   = test_y,
  Pred_RF  = predict(rf_model,   newdata = test_x_scaled),
  Pred_Tree= predict(dt_model,   newdata = test_x_scaled),
  Pred_LM  = predict(linear_model, newdata = test_x_scaled)
)

# plot
ggplot(test_preds, aes(x=Actual, y=Pred_RF)) +
  geom_point(color="red", size=2, alpha=0.7) +
  geom_abline(slope=1, intercept=0, linetype="dashed", color="blue") +
  coord_equal() +
  labs(
    title="Actual vs. Predicted pH (Random Forest)",
    x="Actual pH",
    y="Predicted pH"
  ) +
  theme_minimal()

Most of the red dots hug the diagonal ( blue dashed line), indicating the model’s predictions are generally right where the true pH values lie. We can see we aren’t consistently over- or under-predicting at mid-range pH values

Zooming into Predicted pH by Brand (Random Forest)

test_preds <- tibble(
  Brand.Code = test_data$Brand.Code, 
  Actual     = test_y,
  Pred_RF    = predict(rf_model,    newdata = test_x_scaled),
  Pred_Tree  = predict(dt_model,    newdata = test_x_scaled),
  Pred_LM    = predict(linear_model, newdata = test_x_scaled)
)

#  Faceted Actual vs Predicted (Random Forest) by Brand

ggplot(test_preds, aes(x = Actual, y = Pred_RF)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkgray") +
  facet_wrap(~ Brand.Code) +
  coord_equal() +
  labs(
    title = "Actual vs. Predicted pH by Brand (Random Forest)",
    x     = "Actual pH",
    y     = "Predicted pH"
  ) +
  theme_minimal()

Consistent Accuracy Across Brands: For all four product lines (A–D), the vast majority of points hug the 45° line—showing our RF model reliably forecasts pH regardless of brand.
Tightest Fit for B and D: Brands B and D exhibit the smallest scatter around the diagonal, meaning we can be most confident in predictions for those lines.
Slightly Wider Spread in A and C: A few more outliers appear in Brands A and C—flagging batches where the model misses by >0.2 pH units. Those runs may warrant quick process checks (such as sensor calibration or raw‐ingredient variations).
Ready for Deployment: Because predictive performance holds up across all brands, we can roll this model into our daily quality dashboard and proactively catch any pH drift before it drifts out of spec.

Simple, pairwise Pearson correlations between each process variable and pH

# top 10 by absolute correlation
top_corrs <- corr_values %>%
  mutate(abs_corr = abs(Correlation)) %>%
  arrange(desc(abs_corr)) %>%
  slice(1:10) %>%
  mutate(Predictor = factor(Predictor, levels=rev(Predictor)))

# Plot
ggplot(top_corrs, aes(x = Predictor, y = Correlation, fill = Correlation > 0)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(
    name   = "Sign",
    values = c("TRUE" = "steelblue", "FALSE" = "firebrick"),
    labels = c("Negative", "Positive")
  ) +
  labs(
    title = "Top 10 Process Variables Correlated with pH",
    x     = NULL,
    y     = "Pearson Correlation"
  ) +
  theme_minimal()

These initial correlations guided our choice of variables in the Random Forest model—confirming that flow rate, set-points and pressures are the primary handles for dial-ing in target pH.

Conclusions

Model Selection

Of the five algorithms tested (PLS, linear, ridge, CART, RF), Random Forest delivered the best balance of accuracy and stability:

Mean CV RMSE ≈ 0.10 pH units

Mean CV R² ≈ 0.68

Its ensemble structure smooths out over‐ and under‐predictions seen in single trees and linear fits.

Key Drivers Identified

The RF’s top 5 importance scores were, in order:

Manufacturing Flow
Brand Code (formulation differences)
Usage Rate
Alcohol Release
Filling Air Pressure

These variables should be tracked and fine‐tuned first in any pH control strategy.

Validation and Robustness

On the hold‐out test split (n = 267), RF showed tight clustering around the 45 degree line in Actual vs. Predicted, with 95% of errors within 0.2 pH units.

Faceted by Brand, performance remained consistent—demonstrating no single product line drives model bias.

Limitations

Data scope: We only had about 2,800 total batches and 33 static/time‐aggregated features—future models could incorporate real‐time sensor streams or chemical assay inputs.
Static vs. Dynamic: Current implementation treats each batch as independent. A time‐series or mixed‐effects approach might capture autocorrelations across runs.