As the data science team at ABC Beverage, we are tasked to provide analysis of the company’s manufacturing process, the predictive factors and our predictive model of PH to meet compliance of the new regulations.

Exploratory Data Analysis

Summary Statistics

From the summary statistic, we can see that there is 1 categorical variable Brand.Code, our target variable PH, and 31 predictor variables. Additionally, we can see that many of these predictor variables and Brand.Code contain NAs. Notably, in Brand.Code, approximately 48% were categorized into B making the most frequent brand code.

Descriptive Stats Brand.Code
Brand Code Frequency Percent
120 4.67
A 293 11.40
B 1239 48.19
C 304 11.82
D 615 23.92
Descriptive Stats Numeric Predictors
Variable Count Mean Std Dev Median Min Max NA’s
Carb.Volume 2561 5.37 0.11 5.35 5.04 5.70 10
Fill.Ounces 2533 23.97 0.09 23.97 23.63 24.32 38
PC.Volume 2532 0.28 0.06 0.27 0.08 0.48 39
Carb.Pressure 2544 68.19 3.54 68.20 57.00 79.40 27
Carb.Temp 2545 141.09 4.04 140.80 128.60 154.00 26
PSC 2538 0.08 0.05 0.08 0.00 0.27 33
PSC.Fill 2548 0.20 0.12 0.18 0.00 0.62 23
PSC.CO2 2532 0.06 0.04 0.04 0.00 0.24 39
Mnf.Flow 2569 24.57 119.48 65.20 -100.20 229.40 2
Carb.Pressure1 2539 122.59 4.74 123.20 105.60 140.20 32
Fill.Pressure 2549 47.92 3.18 46.40 34.60 60.40 22
Hyd.Pressure1 2560 12.44 12.43 11.40 -0.80 58.00 11
Hyd.Pressure2 2556 20.96 16.39 28.60 0.00 59.40 15
Hyd.Pressure3 2556 20.46 15.98 27.60 -1.20 50.00 15
Hyd.Pressure4 2541 96.29 13.12 96.00 52.00 142.00 30
Filler.Level 2551 109.25 15.70 118.40 55.80 161.20 20
Filler.Speed 2514 3687.20 770.82 3982.00 998.00 4030.00 57
Temperature 2557 65.97 1.38 65.60 63.60 76.20 14
Usage.cont 2566 20.99 2.98 21.79 12.08 25.90 5
Carb.Flow 2569 2468.35 1073.70 3028.00 26.00 5104.00 2
Density 2570 1.17 0.38 0.98 0.24 1.92 1
MFR 2359 704.05 73.90 724.00 31.40 868.60 212
Balling 2570 2.20 0.93 1.65 -0.17 4.01 1
Pressure.Vacuum 2571 -5.22 0.57 -5.40 -6.60 -3.60 0
PH 2567 8.55 0.17 8.54 7.88 9.36 4
Oxygen.Filler 2559 0.05 0.05 0.03 0.00 0.40 12
Bowl.Setpoint 2569 109.33 15.30 120.00 70.00 140.00 2
Pressure.Setpoint 2559 47.62 2.04 46.00 44.00 52.00 12
Air.Pressurer 2571 142.83 1.21 142.60 140.80 148.20 0
Alch.Rel 2562 6.90 0.51 6.56 5.28 8.62 9
Carb.Rel 2561 5.44 0.13 5.40 4.96 6.06 10
Balling.Lvl 2570 2.05 0.87 1.48 0.00 3.66 1

Duplicated Rows Check and NAs check

We check for any duplicated rows in the train data through the duplicated function. This is to inform us if any duplicated data needs to be handled with appropriately. From the output, we can see that there is no duplicated data. So, next we can begin to check for the percentage of missing data in our dataset, so that we can begin uncovering any patterns related to the missingness.

sum(duplicated(train_data))
## [1] 0

Checking for Data Mechanisms

We conduct a quick NHANES check to see which type of data mechanism it could possibly be. To do this, we can check the missing data mechanism through summary statistics. We do so through looking at the overall missingness per variable, which allows us to identify why the data is missing. From the summary statistic, we are able to see that the missing data is likely MAR (Missing At Random) or MCAR (Missing Completely At Random), but not MNAR (Missing Not At Random). This is due to MFR containing the highest number of NAs (8.2%), which tells us that missingness is not uniform across variables and that most other variables have < 2% missingness.

Missingness per Variable
Variable Percent_NA
MFR 8.2
Filler.Speed 2.2
Fill.Ounces 1.5
PC.Volume 1.5
PSC.CO2 1.5
PSC 1.3
Carb.Pressure1 1.2
Hyd.Pressure4 1.2
Carb.Pressure 1.1
Carb.Temp 1.0
PSC.Fill 0.9
Fill.Pressure 0.9
Filler.Level 0.8
Hyd.Pressure2 0.6
Hyd.Pressure3 0.6
Temperature 0.5
Oxygen.Filler 0.5
Pressure.Setpoint 0.5
Carb.Volume 0.4
Hyd.Pressure1 0.4
Alch.Rel 0.4
Carb.Rel 0.4
Usage.cont 0.2
PH 0.2
Mnf.Flow 0.1
Carb.Flow 0.1
Bowl.Setpoint 0.1
Brand.Code 0.0
Density 0.0
Balling 0.0
Pressure.Vacuum 0.0
Air.Pressurer 0.0
Balling.Lvl 0.0

To validate that the data mechanism is MAR, we can check if MFR against the mean temperature. From the summary, we can see that when MFR is not missing, the mean temperature is lower than when the MFR is missing. This tells us that there’s around a difference of 1.88 degrees Fahrenheit higher mean Temperature when MFR is missing, so MFR does correlates with higher temperature.

Mean Temperature by MFR Missing Status
MFR_missing mean_Temp n
FALSE 65.82 2359
TRUE 67.70 212

Additionally, to validate, we can run a Welch Two Sample t-test to check if it is a significant difference. By rejecting the Null Hypothesis that the null values are MCAR, we accept the alternative that the data mechanism is MAR. Telling us that the missingness has a pattern.

t.test(Temperature ~ MFR_missing, data = train_data)
## 
##  Welch Two Sample t-test
## 
## data:  Temperature by MFR_missing
## t = -9.7049, df = 202.45, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -2.254130 -1.492855
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            65.82247            67.69596

The vis_miss function from the naniar library visualizes any patterns or relationships that may need to be preserved. In the plots below, it confirms our summary statistics as we observed that around 1% of the data contains NAs or missing data. 8% of the missing data being from MFR. Other predictors have missingness < 2.5%.

From our analysis of the missingness in the dataset, we can conclude that the data mechanism is MAR. Since it is MAR, it is suggested that handle the missing data through multiple imputation such as missForest, which is able to iteratively predicting missing values using random forest models. MissForest works well in our case due to working with various kinds of continuous and categorical variables, creating a single completed data set by iteratively predicting missing values.

Analyzing the Distribution of Predictors

In the histograms below, we observe that predictors have varied distributions as some are normal, some are bi-modal, some are skewed. Additionally, predictors have different units and scales (e.g., Carb.Flow in the 1000s, PSC in decimals of 0.1). For distance-based models, this scale difference would incorrectly weight predictors with larger numbers as more important. So, we will apply standardization to all numeric predictors to ensure equal contribution.

Additionally, we check for outliers in the predictors through the box plot below.

It is notable that Fill.Ounces, PC.Volume, PSC, Carb.Pressure1, Fill.Pressure, Temperature, MFR, Oxygen.Filler, Air.Pressurer, Filler.Speed contain a large number of outliers mostly skewed, so we will check if these variables have real impact on the target variable PH.

Upon closer inspection, the predictor variables from the scatter plot below reveals many things in relation to the PH, but first, let’s take a look at the predictors Temperature, Air.Pressurer and Filler.Speed.

In the individual scatter plot of Temperature, Air.Pressurer and Filler.Speed, they show that they act more like a factor/category than numeric. For these predictors, it could possibly be controlled by a discrete machine or factory setting. For Filler.Speed, the outliers scattered can be presumed to be from the machine alternating between 1000 and 4000. Since these predictors form vertical bands, it is suggested that they don’t have high correlation to the PH. However, we can create a correlation table between these predictors and our target variable to verify.

By creating a correlation table, we can see that Air.Pressure and Filler.Speed contain p values > 0.05, signifying no meaningful linear relationship with PH. Temperature has a weak negative correlation to PH with (r = -0.158). While the correlation table shows no meaningful linear relationship, it is not recommended to remove these predictors based on p-values alone, as it disregards non-linear patterns. Additionally, predictors can have meaningful interactions with other predictors to predict PH well. This correlation table is just to test the linear relationships the predictors have on PH and to inform the outlier detection process.

##           Variable Correlation      P_Value Significant
## cor    Temperature      -0.158 8.592733e-16         Yes
## cor1 Air.Pressurer      -0.014 4.891116e-01          No
## cor2  Filler.Speed      -0.025 2.183784e-01          No

Afterwards, we can begin to do more analysis of the predictors left: Temperature, Fill.Ounces, PC.Volume, PSC, Carb.Pressure1, Oxygen.Filler, MFR and Fill.Pressure. We will use the IQR rule for detecting outliers outside the IQR range (Q1 - 1.5 * IQR AND Q3 + 1.5 * IQR).

From the graph, we see:

  • Fill.Ounces, PC.Volume, PSC, Carb.Pressure1 and Fill.Pressure have outliers that aren’t extremely far from the IQR.
  • Oxygen.Filler, MFR and Temperature suggests extreme outliers.

From our previous observations, we note that Temperature acts more like a factor/category than numeric predictor.

The above scatterplot has overplotting issues due to the data points clustering in a cloud-like formation, so we decided on randomly sampling 500 rows of the dataset to reduce this issue. From the randomly sampled scatter plot of the predictors, we can see that:

  • MFR and Fill.Pressure exhibits factor/category likeness, rather than numeric likeness. Distribution across the PH regression line suggestions no to weak correlation.
  • Fill.Ounces, PC.Volume, PSC, Carb.Pressure1 and Oxygen.Filler points on the scatterplot exhibit weak/no relationship to the regression line of PH.

We can additionally check MFR and Fill.Pressure for their correlation to PH as they exhibit vertical clustering. From the table, we can see that MFR has a p values > 0.05, signifying no meaningful linear relationship with PH. While Fill.Pressure’s p-value is < 0.05 and signifies linear correlation to PH.

##           Variable Correlation      P_Value Significant
## cor            MFR      -0.009 6.721382e-01          No
## cor1 Fill.Pressure      -0.213 1.199531e-27         Yes

From analyzing the predictors with seemingly large number of outliers or extreme outliers, it should be noted that none of the tested predictors have a notable significance to PH. Additionally, predictors such as MFR, Fill.Pressure, Temperature, Air.Pressure and Filler.Speed exhibits factor/category likeness, rather than numeric likeness and their range outliers most likely stem from the machine or factory setting.

Analyzing Brand.Code

Afterwards, we can check if the categorical variable Brand.Code has any significant predictive effects on the target variable PH. This is to inform us whether we should keep our categorical variable or not. Through the boxplot, we can see that each brand (including NAs) have varying median lines, indicating that they are different types of beverages that potentially come from various manufacturing processes. Interestingly, the NAs Brand.Code distribution ties closely to the distribution of all over brand.codes.

Let’s confirm that each brand has a statistically different PH distribution from every other brand through a mann-Whitney U test. As shown below, we can see that each pairwise comparison of Brand.Code has a statistically significant pH value (all p-values < 0.05 after Bonferroni correction), indicting that the brands represent different beverage types or formulations, not only a minor change/variation.

#mann-Whitney U test pairwise comparisons, labeled brand.code nas as missing
ph_by_brand <- train_data %>%
  mutate(Brand.Code = forcats::fct_na_value_to_level(Brand.Code, "Missing"))

#https://www.statology.org/bonferroni-correction/
#pairwise Mann-Whitney U test with bonferroni
#in Bonferroni correction each raw p-value is multiplied by ( m ) (compare each p-value to (\alpha/m)).
#ensures strong control of the family-wise error rate.
pairwise_results <- pairwise.wilcox.test(
  ph_by_brand$PH, 
  ph_by_brand$Brand.Code,
  p.adjust.method = "bonferroni",
  paired = FALSE
)

print(pairwise_results)
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  ph_by_brand$PH and ph_by_brand$Brand.Code 
## 
##           A       B       C      
## A 1.00000 -       -       -      
## B 0.00016 2.0e-07 -       -      
## C 0.00165 3.6e-09 < 2e-16 -      
## D 4.2e-10 < 2e-16 0.00058 < 2e-16
## 
## P value adjustment method: bonferroni

From an ANOVA test of the Brand.Code to see the effects of Brand.Code on PH, we can see that p < 2e-16, signifying it as highly significant. Additionally, Brand.Code explains 11.7% of the variance in PH. Because of this, we decided to keep Brand.Code.

aov_result <- aov(PH ~ Brand.Code, data = train_data)
summary(aov_result)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## Brand.Code     4   8.90  2.2258   84.52 <2e-16 ***
## Residuals   2562  67.47  0.0263                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 4 observations deleted due to missingness
ss <- summary(aov_result)[[1]]$"Sum Sq"
r_sq <- ss[1] / sum(ss)
#variance by brand.code
round(r_sq * 100, 2)
## [1] 11.66

Analyzing Variable Correlations

Before we begin to check the correlation of predictors, it is important to exclude PH from the training data so that it doesn’t affect any of the predictor correlations. After doing so, we can start analyzing the relationships of the variables. Through the correlation plot below, we can see that variables such as Hy.Pressure3, Filler.Level, Filler.Speed, Density, Balling, Alch.Rel and Carb.Rel have correlations > 85%. Before we remove the variables, however, we can use the function findCorrelation() to match our correlation plot.

From the findCorrelation function, we can see that they match the analysis of the correlation plot, with the exception of Carb.Rel. This is likely due to carb.rel having an average correlation that is less than 85% with the other predictors. As such, it will not be a candidate for removal. However, the rest of the predictors: “Balling”, “Hyd.Pressure3”, “Alch.Rel”, “Balling.Lvl”, “Density”, “Filler.Level” and “Filler.Speed” is recommended.

## [1] "Balling"       "Hyd.Pressure3" "Balling.Lvl"   "Alch.Rel"     
## [5] "Density"       "Filler.Level"  "Filler.Speed"

Reccomendations from EDA

  • Predictors: Remove Balling, Hyd.Pressure3, Alch.Rel, Balling.Lvl, Density,Filler.Level, and Filler.Speed due to high correlation with other predictors.
  • Consider robust methods or tree-based models over standard linear regression when dealing with the predictor outliers. Several predictors (MFR, Fill.Pressure, Temperature, Air.Pressure, Filler.Speed) show non-normal distributions and non/weak linear correlations with PH. In addition to exhibiting patterns as that of an ordinal categorical variable. This approach could be less sensitive to outliers and can capture non-linear patterns without requiring transformations.
  • After removal of suggested predictors, it is recommended to use multiple imputation (such as missForest) on NAs. As it is flexible to our sample size of predictors. Since we also will be keeping our categorical variable, Brand.Code, missForest allows for working with various kinds of continuous and categorical variables, creating a single completed data set by iteratively predicting missing values.
  • Consider standardization and normalization of predictors in the model building process due to predictors having different units and scale.

Note: Brand.Code most likely represents different beverage types or formulations, not only a minor change/variation.

Data Preparation

Train and Validation Data Split

Before we impute missing values, handle outliers, or transform features, we must split our dataset. So, we partitioned into training and validation subsets by randomly selecting 80% for training and leaving the remaining 20% set aside for testing. If we calculate global metrics like means for imputation or correlation thresholds on the entire dataset, information from the validation set will leak into the training process. By splitting first, we ensure our preprocessing rules are learned only from the training data and then blindly applied to the validation data, giving us a true measure of how our model will perform in the real world.

url_train <- "https://raw.githubusercontent.com/uzmabb182/Data_624_Team_Project_2/refs/heads/main/StudentData.csv"
raw_data <- read.csv(url_train, stringsAsFactors = FALSE)

# Drop rows where target is missing (standard)
raw_data_clean <- raw_data %>% filter(!is.na(PH))

# Split
set.seed(42)
train_index <- createDataPartition(raw_data_clean$PH, p = 0.8, list = FALSE)

train_set <- raw_data_clean[train_index, ]
val_set   <- raw_data_clean[-train_index, ]

#checking size of train and validation set
#nrow(train_set)
#nrow(val_set)

Removing Highly Correlated Predictors

As we noted in the EDA feedback, several predictors in this dataset (like Balling, Hyd.Pressure3, Density, etc.) are highly correlated with each other (correlation > 0.85). This is called multicollinearity.

If two variables are perfectly moving together, they are essentially telling the model the exact same piece of information twice. Leaving those variables in confuses the algorithm and it doesn’t know which variable to give credit to, slows down computation, and makes the model prone to overfitting. We must identify these redundant variables in our training set and remove them from both sets.

Now, we will calculate the correlation matrix on the numeric predictors. We use pairwise.complete.obs to ensure that the current missing values (NAs) don’t crash our math. Then, we use the findCorrelation function to automatically flag the variables that cross our 0.85 threshold.

# Isolate only the numeric predictor variables from the training set
# (We exclude PH because it is the target, and Brand.Code because it is categorical)
numeric_train <- train_set %>% select(where(is.numeric), -PH)

# Calculate the correlation matrix
# We use pairwise.complete.obs to safely handle the NAs currently in the data
cor_matrix <- cor(numeric_train, use = "pairwise.complete.obs")

# Find predictors with an absolute correlation greater than 0.85
high_cor_indices <- findCorrelation(cor_matrix, cutoff = 0.85, exact = TRUE)
high_cor_vars <- colnames(numeric_train)[high_cor_indices]

# We will also be removing Alch.Rel as per our EDA analysis
high_cor_vars <- c(high_cor_vars, "Alch.Rel")

# Vars to remove with high correlation
print(high_cor_vars)
## [1] "Balling"       "Hyd.Pressure3" "Balling.Lvl"   "Density"      
## [5] "Filler.Level"  "Filler.Speed"  "Alch.Rel"

The split training and validation sets have now 26 predictors, instead of 31.

train_set_reduced <- train_set %>% select(-all_of(high_cor_vars))
val_set_reduced <- val_set %>% select(-all_of(high_cor_vars))

# New Training Set Dimensions
dim(train_set_reduced)
## [1] 2055   26
# New Validation Set Dimensions
dim(val_set_reduced)
## [1] 512  26

Random Forest Imputation

missForest is running a full machine learning prediction process for every single column that has an NA. It treats the missing data as the target to be predicted, and uses all the non-missing data as the clues.

set.seed(42)

train_imputation <- missForest(train_predictors, 
                               maxiter = 5,    
                               ntree = 100,    
                               verbose = TRUE) 
##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.1677212 0.1346252 
##     difference(s): 0.0006085918 0.02481752 
##     time: 1.69 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.1637871 0.1371749 
##     difference(s): 6.235666e-06 0.00973236 
##     time: 2.09 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.1611763 0.1336053 
##     difference(s): 5.843416e-06 0.009245742 
##     time: 2.31 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.1592056 0.1341152 
##     difference(s): 5.449064e-06 0.01021898 
##     time: 2.05 seconds
## 
##   missForest iteration 5 in progress...done!
##     estimated error(s): 0.159595 0.1315655 
##     difference(s): 3.784334e-06 0.01021898 
##     time: 2.05 seconds
# Extract the clean data and reattach our target variable
train_set_clean <- train_imputation$ximp
train_set_clean$PH <- train_set_reduced$PH

Imputing the Validation Set

The future machine learning model needs clean, complete data to make its predictions. If we leave NAs in the validation set, the model will crash when we try to test it later. We will use the exact same missForest algorithm we did on the training set to fill in the blanks here.

##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.2261653 0.1769547 
##     difference(s): 0.0001755827 0.0234375 
##     time: 0.54 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.2213659 0.1831276 
##     difference(s): 6.491444e-06 0.005859375 
##     time: 0.51 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.2282613 0.1666667 
##     difference(s): 3.899296e-06 0.005859375 
##     time: 0.53 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.2320909 0.1748971 
##     difference(s): 2.175098e-06 0.00390625 
##     time: 0.52 seconds
## 
##   missForest iteration 5 in progress...done!
##     estimated error(s): 0.2269774 0.1687243 
##     difference(s): 3.023564e-06 0.0078125 
##     time: 0.55 seconds
## Total NAs remaining in the validation set: 0

Centering and Scaling

Our EDA shows predictors have different units and scales. If we use a distance-based machine learning model (like K-Nearest Neighbors, Neural Networks, or Support Vector Machines), the algorithm will look at those numbers and assume that Pressure is the most important variable simply because the numbers are bigger.

To fix this, we standardizes the data:

  • Center: Subtracts the mean (so every variable’s average becomes 0).
  • Scale: Divides by the standard deviation (so every variable’s spread is 1).

We must calculate the average and standard deviation ONLY on the Training Set, and then apply that exact same math to the Validation Set. If we calculate the validation set’s own average, we risk leaking future data into our model.

# Identify ONLY the numeric predictor columns 
# (We exclude our target 'PH', and 'where(is.numeric)' automatically excludes 'Brand.Code')
numeric_cols <- train_set_clean %>% 
  select(where(is.numeric), -PH) %>% 
  names()

# Build the Scaling Math based only on the Training Set
scale_model <- preProcess(train_set_clean[, numeric_cols], 
                          method = c("center", "scale"))

# Apply the scaling math to the Training Set
train_set_final <- predict(scale_model, train_set_clean)

# Apply the exact same scaling math to the Validation Set
val_set_final <- predict(scale_model, val_set_clean)

#Check the summary of Fill.Ounces to see the new standardized scale
summary(train_set_final$Fill.Ounces)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.98555 -0.63492 -0.01155  0.00000  0.61183  4.04038

Model Building and Evaluation

Before training the other models, setting up a baseline Linear Regression provides a critical benchmark. If a highly complex algorithm only marginally improves the RMSE (Root Mean Squared Error) over a basic linear model, the computational cost might not be justified.

To ensure all models are evaluated fairly without touching the isolated validation set, K-Fold Cross-Validation (5 folds) is implemented during the training phase.

# Establish the Cross-Validation Framework
set.seed(42)
cv_control <- trainControl(method = "cv", 
                           number = 5, 
                           savePredictions = "final")

Baseline Linear Regression Model

# Training Baseline Linear Regression Model
set.seed(42)
lm_model <- train(PH ~ ., 
                  data = train_set_final, 
                  method = "lm", 
                  trControl = cv_control)
Linear Regression Performance (5-Fold CV)
Performance Metrics
RMSE Rsquared MAE
0.1377 0.3625 0.1077

In our baseline model, we can see its performance:

  • RMSE: 0.1377
  • Rsquared: 0.36
  • MAE: 0.1077

Though our linear regression model was fast to train, the baseline model only explains 36% of the variance in PH. The typical prediction error is ±0.1377 PH units. For beverage manufacturing quality control, these error margins are unacceptable as they can disrupt processes, damage equipment, and compromise product quality or safety.

As such, we will only be using this as comparison to the other models we will be evaluating.

Linear regression diagnostic plot analysis

Our diagnostic plots reveal that linear regression violates multiple statistical assumptions, confirming it is inappropriate for this data:

  • Residual vs Fitted: The slightly curved red line indicates a weak non-linear relationship that a straight line cannot capture. This explains the model’s R² (0.36).
  • Q-Q Residuals: Points bend away from diagonal line at both ends, indicating non-normal residuals. As such, the model consistently over/under predicts at extreme PH values, which negatively affects quality control.
  • Scale-Location: Red line slopes downwards slightly, indicating heteroscedasticity.
  • Residuals vs Leverage: Points 1093, 494, 2083 have high leverage. These influential points disproportionately affect the regression line, making the model unstable.

Using linear regression would lead to unreliable PH predictions, equipment damage, and safety and quality control issues.

Random Forest Model

set.seed(42)

rf_model <- train(PH ~ ., 
                        data = train_set_final, 
                        method = "rf", 
                        trControl = cv_control,
                        tuneLength = 3,    # Tests 3 different 'mtry' tuning parameters
                        importance = TRUE)

print(rf_model)
## Random Forest 
## 
## 2055 samples
##   25 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1643, 1644, 1645, 1644, 1644 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1185277  0.5693689  0.09036701
##   14    0.1072305  0.6233415  0.07870450
##   27    0.1076169  0.6138156  0.07774770
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.
Random Forest Performance (5-Fold CV)
RMSE Rsquared MAE
0.1072 0.6233 0.0787

In our random forest model, we can see its performance:

  • RMSE: 0.1072
  • Rsquared: 0.62
  • MAE: 0.0777

Our random forest model explains ~62% of variance in PH. The typical prediction error is ±0.1072 PH units.

Random Forest Analysis

From our diagnostics of the Random Forest model below, we can see:

  • Predicted v. Actual PH: The points cluster tightly, with no systematic over/under prediction. This means the model doesn’t consistently favor high or low PH values.
  • Residuals v. Predicted PH: Errors are random and small at ±0.1 PH, with no discernible pattern.
# Get predictions 
rf_predictions <- predict(rf_model, newdata = train_set_final)

# Create a temporary dataframe for plotting only
plot_df <- data.frame(
  PH = train_set_final$PH,
  predicted = rf_predictions
)

# Predicted vs Actual PH
p1 <- ggplot(plot_df, aes(x = PH, y = predicted)) +
  geom_point(alpha = 0.5, color = "#b19cd9") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
  coord_fixed() +
  labs(title = "Predicted v. Actual PH",
       x = "Actual PH", y = "Predicted PH") +
  theme_minimal()

# Residuals vs Predicted PH
plot_df$residuals <- plot_df$PH - plot_df$predicted

p2 <- ggplot(plot_df, aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#b19cd9") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
  labs(title = "Residuals v. Predicted PH",
       x = "Predicted PH", y = "Residuals") +
  theme_minimal()

p1 + p2

K-Nearest Neighbor Model

# "Training KNN Model
set.seed(42)
knn_model <- train(PH ~ ., 
                   data = train_set_final, 
                   method = "knn",
                   tuneGrid = expand.grid(k = seq(3, 21, by = 2)),
                   trControl = cv_control,
                   metric = "RMSE")
KNN Performance (5-Fold CV)
Best k
Performance Metrics
k RMSE Rsquared MAE
11 0.1292 0.4503 0.0973

In our K-Nearest Neighbor model, we can see its performance:

  • RMSE: 0.1292
  • Rsquared: 0.45
  • MAE: 0.0947

Our K-Nearest Neighbor model explains ~45% of variance in PH. The typical prediction error is ±0.1292 PH units. We will make comparisons to the baseline model, after all models have been evaluated.

KNN is an improvement over the linear regression model.

KNN Analysis

From our diagnostics of the K-Nearest Neighbor Model below, we can see:

  • Predicted v. Actual PH: Points cluster loosely, with systematic over prediction at higher PH values.
  • Residuals v. Predicted PH: Errors are random but some are as large at ±0.25 PH.

Cubist Model

X_val_fixed <- val_set_final %>%
  select(-PH) %>%
  select(all_of(names(train_set_final %>% select(-PH))))

# Add back PH
val_set_fixed <- X_val_fixed
val_set_fixed$PH <- val_set_final$PH

cubist_model <- train(PH ~ ., 
                   data = train_set_final, 
                   method = "cubist",
                   trControl = cv_control,
                   tuneLength = 5)
Cubist Performance (5-Fold CV)
RMSE Rsquared MAE
0.1084 0.6072 0.0789

In our Cubist model, we can see its performance:

  • RMSE:0.1084
  • Rsquared: 0.61
  • MAE: 0.0789

Our Cubist model explains ~61% of variance in PH. The typical prediction error is ±0.1084 PH units.

Cubist Analysis

From our diagnostics of the Cubist model below, we can see:

  • Predicted v. Actual PH: The points cluster tightly, with no systematic over/under prediction. This means the model doesn’t consistently favor high or low PH values.
  • Residuals v. Predicted PH: Errors are random and small mostly at ±0.1 PH, with no discernible pattern. While some larger errors are at ±0.2 PH.

Treebag Model

treebag_model <- train(PH ~ ., 
                       data = train_set_final, 
                       method = "treebag",
                       trControl = cv_control,
                       nbagg = 100)

print(treebag_model)
## Bagged CART 
## 
## 2055 samples
##   25 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1643, 1645, 1643, 1645, 1644 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1264163  0.4665561  0.09760613
Tree Bag Performance (5-Fold CV)
RMSE Rsquared MAE
0.1264 0.4666 0.0976

In our Treebag model, we can see its performance:

  • RMSE:0.1264
  • Rsquared: 0.47
  • MAE: 0.0976

Our Treebag model explains ~47% of variance in PH. The typical prediction error is ±0.1264 PH units.

Treebag Analysis

From our diagnostics of the Treebag model below, we can see:

  • Predicted v. Actual PH: The points spread loosely. The model over predicts points with higher PH and under predicts point with lower PH.
  • Residuals v. Predicted PH: Errors are random and small mostly at ±0.10-0.25 PH, with no discernible pattern. While some larger errors are at ±0.50 PH.

XGBoost Model

XGBoost Performance (5-Fold CV)
RMSE Rsquared MAE
0.1193 0.5219 0.0906

In our XGBoost model, we can see its performance:

  • RMSE:0.1193
  • Rsquared: 0.52
  • MAE: 0.0905

Our Treebag model explains ~52% of variance in PH. The typical prediction error is ±0.12 PH units.

XGBoost Analysis

From our diagnostics of the XGBoost model below, we can see:

  • Predicted v. Actual PH: The points moderately cluster around the reference line. The model tends to over predict points with higher PH.
  • Residuals v. Predicted PH: Errors are random and small mostly at ±0.10 PH, with no discernible pattern. While some larger errors are at ±0.25 PH.

Model Performance Comparison Summary

Model Performance Comparison (5-Fold CV)
Performance Metrics
Model RMSE Rsquared MAE
Random Forest 0.1072 0.6233 0.0777
Cubist 0.1084 0.6072 0.0789
XGBoost 0.1193 0.5219 0.0905
Tree Bag 0.1264 0.4666 0.0976
KNN 0.1292 0.4503 0.0947
Linear Regression 0.1377 0.3625 0.1077

From our model comparison summary above, there are differences between models.

  • Linear Regression Model: From our baseline model, we can see that it only explained 36% of PH variance with an RMSE of 0.1377 and MAE of 0.1077, which is unacceptable for manufacturing quality control.
  • KNN Model: showed modest improvement over linear regression. As it explains 45% of the PH variance with an RMSE of 0.1292 and MAE of 0.0947. While KNN captured some non-linear patterns, its performance remained insufficient for production use.
  • Tree Bag: In this model, it delivered moderate performance. Explaining 47% of variance with an RMSE of 0.1264 and MAE of 0.0976, it slightly outperforms KNN model. This ensemble method improved over KNN and linear models, confirming that tree-based approaches are more suitable for this data.
  • XGBoost: This model showed stronger performance despite the lower metrics than expected. As it explains 52% of the variance, and has an RMSE of 0.1193 and MAE of 0.0905, it performs better than our baseline, KNN and Tree Bag models. The boosting approach captured complex patterns, though tuning constraints may have limited its full potential.
  • Cubist: The Cubist Model performed competitively among tree-based models. Being able to explain 61% of the variance with an RMSE of 0.1084 and MAE of 0.0789. As a rule-based model, Cubist offered a good balance of accuracy and interpretability, making it valuable for understanding decision rules.
  • Random Forest: Emerged as the best overall model, explaining 62% of the variance with an RMSE of 0.1072 and MAE of 0.0777.

Selecting the Model

Due to Random Forest containing the highest Rsquared and lowest RSME and MAE, it is recommended for production deployment. It explains 62% of PH variance with typical prediction errors of ±0.11 PH units. While further improvement is possible, this model provides the most reliable predictions for quality control in beverage manufacturing.

Random Forest Top Predictors

With Random Forest selected as our final model, we examine predictor importance to understand which variables drive PH predictions. This reveals the key sensors and process measurements that most influence beverage quality.

From the plot of the top 15 predictors of PH above, it reveals clear rankings of influence on PH:

  • Mnf.Flow: Is the most dominant predictor of PH. Manufacturing flow rate has the strongest influence on PH more than any other process measurement.
  • Brand.CodeC: Brand C has a distinct chemical profile compared to other brands.
  • Usage.cont: Continuous usage measurements moderately influence PH.
  • Pressure.Vacuum: Vacuum pressure conditions affect acidity levels.

The remaining top 15 predictors show modest but meaningful contributions to PH prediction and not primary drivers.

Testing our Validation Set

We now test the Random Forest model on the 20% validation set.

# Final validation predictions
final_predictions <- predict(rf_model, newdata = val_set_final)

# Calculate metrics
final_score <- postResample(pred = final_predictions, obs = val_set_final$PH)

final_results <- data.frame(
  Value = round(c(final_score[1], final_score[2], final_score[3]), 4)
)

final_results %>%
  kable(caption = "Random Forest Validation Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")
Random Forest Validation Performance
Value
RMSE 0.0992
Rsquared 0.6804
MAE 0.0744
  1. The model captures 68% of PH variance on unseen data. Linear regression managed only 36%. This is the difference between guessing and knowing.
  2. RMSE = 0.0992: Predictions stay within a tenth of a pH point. When the target is 8.50, the model delivers 8.40–8.60.
  3. Training RMSE is 0.1061 while Validation RMSE is 0.0992: The model actually performs better on new data. This proves the model learned the actual relationships in the manufacturing process, not just memorized the training data.
##   removed variable(s) 21 due to the missingness of all entries

Finalized Model Predictions On Beverage Manufacturing

After predicting the test data with our random forest model, we can create some plots to visualize typical factory predictions.

In our histogram of our predicted PH below, we can anticipate:

  • The vast majority of batches cluster tightly around 8.5 pH, which represents normal production for most brands.
  • A smaller but clear second cluster appears near 8.7 pH. Based on our variable importance analysis (Brand.CodeC ranked second), this spike corresponds to Brand C. The model automatically adjusts pH upward when this brand runs on the line.
hist_plot <- ggplot(submission_file, aes(x = Predicted_PH)) +
  geom_histogram(fill = "#b19cd9", color = "white", bins = 20) +
  theme_minimal() +
  labs(title = "What the Factory will produce today (Predicted PH)",
       x = "Predicted PH Level",
       y = "Number of Batches")

print(hist_plot)

In the graph below, we can determine whether the predictions look normal:

  • The predicted pH distribution (in orange) falls entirely within historical bounds (in gray), confirming the model respects physical process limits.
  • The sharper orange peak reflects Random Forest’s confidence in specific machine set-points, not error.
train_ph <- data.frame(PH_Value = train_set_final$PH, Data_Type = "1. Historical (Training)")
test_ph <- data.frame(PH_Value = submission_file$Predicted_PH, Data_Type = "2. Predicted (Test)")
combined_ph <- rbind(train_ph, test_ph)

overlap_plot <- ggplot(combined_ph, aes(x = PH_Value, fill = Data_Type)) +
  geom_density(alpha = 0.6) + 
  theme_minimal() +
  scale_fill_manual(values = c("1. Historical (Training)" = "darkgray", 
                               "2. Predicted (Test)" = "darkorange")) +
  labs(title = "Prediction v. Reality",
       x = "PH Value",
       y = "Density",
       fill = "Data Source")

print(overlap_plot)

Conclusion

Random Forest was selected as the final model, achieving the highest R² and lowest RMSE among all candidates. Its top predictors Mnf.Flow, Brand.CodeC, Usage.cont, and Pressure.Vacuum represent the key process variables governing beverage pH.

Future Directions

Now that we have our final predictions, the ABC Beverage data science team has identified several next steps:

  • Root cause analysis on high-risk batches. By identifying which predictor combinations lead to out-of-spec pH.
  • Developing an interactive dashboard enabling operators to adjust key parameters (e.g., Mnf.Flow, Temperature) and instantly see predicted pH, transforming the model into a practical decision support tool.
  • Deploying the model as a real-time monitoring system by integrating it into the factory’s sensor stream. This will flag pH deviations before batches complete, enabling immediate corrective action.

Appendix

The appendix includes all the code included to build this technical report.

# Load libraries
library(tidyverse)
library(mice)      # For advanced imputation
library(caret)     # For data splitting and pre-processing
#library(rcompanion)
library(ggplot2)
library(UpSetR) # Upset plot
library(corrplot) # corelation plot
library(tidyr)
library(patchwork)
library(kableExtra)
library(naniar)
library(psych)
library(dplyr)
library(rstatix)
library(mice)
library(dplyr)
library(visdat)
library(missForest)
library(xgboost)
library(randomForest)
library(fastDummies)
library(patchwork)

# Load the datasets
train_data <- read.csv("https://raw.githubusercontent.com/uzmabb182/Data_624_Team_Project_2/refs/heads/main/StudentData.csv", stringsAsFactors = FALSE)

test_data <- read.csv("https://raw.githubusercontent.com/uzmabb182/Data_624_Team_Project_2/refs/heads/main/StudentEvaluation.csv", stringsAsFactors = FALSE)

cat_summary <- train_data %>%
  count(Brand.Code) %>%
  mutate(
    Percentage = round(n / sum(n) * 100, 2)
  ) %>%
  rename(
    "Brand Code" = Brand.Code,
    "Frequency" = n,
    "Percent" = Percentage
  )

cat_summary %>%
  kable(caption = "Descriptive Stats Brand.Code") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
numeric_vars <- train_data %>% select(where(is.numeric))

num_summary <- describe(numeric_vars) %>%
  rownames_to_column("Variable") %>%
  mutate(NA_Count = sapply(numeric_vars, function(x) sum(is.na(x)))) %>%
  select(Variable, n,  mean, sd, median, min, max, NA_Count) %>%
  mutate(across(where(is.numeric), ~ round(., 2))) %>%
  rename(
    "Count" = n,
    "Mean" = mean,
    "Std Dev" = sd,
    "Median" = median,
    "Min" = min,
    "Max" = max,
    "NA's" = NA_Count
  )

num_summary %>%
  kable(caption = "Descriptive Stats Numeric Predictors") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

#summary(train_data)

sum(duplicated(train_data))

na_summary <- train_data %>%
  summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
  tidyr::pivot_longer(everything(), names_to = "Variable", values_to = "Percent_NA") %>%
  mutate(Percent_NA = round(Percent_NA, 1)) %>%
  arrange(desc(Percent_NA))

na_summary %>%
  kable(caption = "Missingness per Variable") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

train_data <- train_data %>%
  mutate(MFR_missing = is.na(MFR))

mar_check <- train_data %>%
  group_by(MFR_missing) %>%
  summarise(mean_Temp = round(mean(Temperature, na.rm = TRUE), 2),
            n = n())

mar_check %>%
  kable(caption = "Mean Temperature by MFR Missing Status") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

t.test(Temperature ~ MFR_missing, data = train_data)

vis_miss(train_data) + 
  coord_flip() 

#histogram
train_data %>%
  select(where(is.numeric)) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "#b19cd9", alpha = 0.7) +
  facet_wrap(~name, scales = "free", ncol = 4) +
  theme_minimal()

all_numeric <- train_data %>%
  select(where(is.numeric), -PH) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value")

var_list <- unique(all_numeric$variable)
plots <- list()

for(i in seq_along(var_list)) {
  plots[[i]] <- ggplot(all_numeric %>% filter(variable == var_list[i]), 
                       aes(x = variable, y = value)) +
    geom_boxplot(fill = "#b19cd9", color = "#483d8b", alpha = 0.7,
                 outlier.color = "darkred", outlier.size = 0.8) +
    labs(title = var_list[i], x = "", y = "") +
    theme_minimal(base_size = 8) +
    theme(plot.title = element_text(size = 9, hjust = 0.5),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank())
}

wrap_plots(plots, ncol = 5)

all_numeric <- train_data %>%
  select(where(is.numeric)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value")

#predictors with high num of outliers
selected_vars <- c("Fill.Ounces", "PC.Volume", "PSC", "Carb.Pressure1", 
                   "Fill.Pressure", "Temperature", "MFR", "Oxygen.Filler", "Air.Pressurer", "Filler.Speed")

plots <- list()

for(i in seq_along(selected_vars)) {
  var <- selected_vars[i]
  plot_data <- all_numeric %>% filter(variable == var)
  
  plots[[i]] <- ggplot(plot_data, aes(x = variable, y = value)) +
    geom_boxplot(fill = "#b19cd9", color = "#483d8b", alpha = 0.7,
                 outlier.color = "darkred", outlier.size = 0.8) +
    labs(title = var, x = "", y = "") +
    theme_minimal(base_size = 8) +
    theme(plot.title = element_text(size = 9, hjust = 0.5),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank())
}

wrap_plots(plots, ncol = 3)

for(i in seq_along(selected_vars)) {
  var <- selected_vars[i]
  
  plots[[i]] <- ggplot(train_data, aes(x = .data[[var]], y = PH)) +
    geom_point(alpha = 0.4, size = 1) +
    geom_smooth(method = "lm", se = FALSE, color = "darkred") +
    theme_minimal(base_size = 8) +
    labs(title = var, x = "", y = "PH") +
    theme(plot.title = element_text(size = 9, hjust = 0.5))
}

wrap_plots(plots, ncol = 3)

plot(train_data$Temperature, train_data$PH,
     main = "Temperature vs PH", 
     xlab = "Temperature", ylab = "PH")
plot(train_data$Air.Pressurer, train_data$PH, 
     main = "Air.Pressurer vs PH", 
     xlab = "Air.Pressurer", ylab = "PH")
plot(train_data$Filler.Speed, train_data$PH, 
     main = "Filler.Speed vs PH", 
     xlab = "Filler.Speed", ylab = "PH")

vars <- c("Temperature", "Air.Pressurer", "Filler.Speed")

cor_table <- data.frame()

for(var in vars) {
  test <- cor.test(train_data[[var]], train_data$PH, use = "complete.obs")
  cor_table <- rbind(cor_table, data.frame(
    Variable = var,
    Correlation = round(test$estimate, 3),
    P_Value = test$p.value,
    Significant = ifelse(test$p.value < 0.05, "Yes", "No")
  ))
}

cor_table$P_Value <- format(cor_table$P_Value, scientific = TRUE)
print(cor_table)

vars <- c("Fill.Ounces", "PC.Volume", "PSC", "Carb.Pressure1", "Oxygen.Filler", "MFR", "Fill.Pressure", "Temperature")

plots <- list()

for(var in vars) {
  #iqr
  Q1 <- quantile(train_data[[var]], 0.25, na.rm = TRUE)
  Q3 <- quantile(train_data[[var]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR
  upper <- Q3 + 1.5 * IQR
  
  #flagging the outliers
  plot_data <- train_data %>%
    mutate(is_outlier = train_data[[var]] < lower | train_data[[var]] > upper,
           outlier_color = ifelse(is_outlier, "Outlier", "Normal"))
  
  p <- ggplot(plot_data, aes(x = .data[[var]], y = PH, color = outlier_color)) +
    geom_point(alpha = 0.6, size = 1) +
    geom_smooth(method = "lm", color = "#000000", se = FALSE, size = 0.8) + 
    scale_color_manual(values = c("Normal" = "#b19cd9", "Outlier" = "darkred")) +
    theme_minimal(base_size = 8) +
    theme(legend.position = "none",
          plot.title = element_text(size = 9, hjust = 0.5)) +
    labs(title = var, x = "", y = "PH")
  
  plots[[var]] <- p
}

wrap_plots(plots, ncol = 4)

set.seed(5555)
sample_data <- train_data %>% sample_n(500)

plots <- list()

for(var in vars) {
  Q1 <- quantile(train_data[[var]], 0.25, na.rm = TRUE)
  Q3 <- quantile(train_data[[var]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR
  upper <- Q3 + 1.5 * IQR
  
  plot_data <- sample_data %>%
    mutate(is_outlier = sample_data[[var]] < lower | sample_data[[var]] > upper,
           outlier_color = ifelse(is_outlier, "Outlier", "Normal"))
  
  p <- ggplot(plot_data, aes(x = .data[[var]], y = PH, color = outlier_color)) +
    geom_point(alpha = 0.6, size = 1) +
    geom_smooth(method = "lm", color = "darkred", se = FALSE, size = 1) +
    scale_color_manual(values = c("Normal" = "#b19cd9", "Outlier" = "darkred")) +
    theme_minimal(base_size = 8) +
    theme(legend.position = "none",
          plot.title = element_text(size = 9, hjust = 0.5)) +
    labs(title = var, x = "", y = "PH")
  
  plots[[var]] <- p
}

wrap_plots(plots, ncol = 4)

vars <- c("MFR", "Fill.Pressure")

cor_table <- data.frame()

for(var in vars) {
  test <- cor.test(train_data[[var]], train_data$PH, use = "complete.obs")
  cor_table <- rbind(cor_table, data.frame(
    Variable = var,
    Correlation = round(test$estimate, 3),
    P_Value = test$p.value,
    Significant = ifelse(test$p.value < 0.05, "Yes", "No")
  ))
}

cor_table$P_Value <- format(cor_table$P_Value, scientific = TRUE)
print(cor_table)

boxplot(PH ~ Brand.Code, data = train_data, 
        las = 2, xlab = "Brand Code", ylab = "PH", 
        main = "PH Distribution by Brand Code")

#mann-Whitney U test pairwise comparisons, labeled brand.code nas as missing
ph_by_brand <- train_data %>%
  mutate(Brand.Code = forcats::fct_na_value_to_level(Brand.Code, "Missing"))

#https://www.statology.org/bonferroni-correction/
#pairwise Mann-Whitney U test with bonferroni
#in Bonferroni correction each raw p-value is multiplied by ( m ) (compare each p-value to (\alpha/m)).
#ensures strong control of the family-wise error rate.
pairwise_results <- pairwise.wilcox.test(
  ph_by_brand$PH, 
  ph_by_brand$Brand.Code,
  p.adjust.method = "bonferroni",
  paired = FALSE
)

print(pairwise_results)

aov_result <- aov(PH ~ Brand.Code, data = train_data)
summary(aov_result)

ss <- summary(aov_result)[[1]]$"Sum Sq"
r_sq <- ss[1] / sum(ss)
#variance by brand.code
round(r_sq * 100, 2)

#excludingPH from numeric data
numeric_data <- train_data %>% 
  select(where(is.numeric), -PH)

cor_matrix <- cor(numeric_data, use = "complete.obs")

cor_matrix_filtered <- cor_matrix
cor_matrix_filtered[abs(cor_matrix_filtered) < 0.85] <- NA

corrplot(cor_matrix_filtered, 
         method = "color", 
         type = "upper", 
         tl.cex = 0.7,
         na.label = " ",
         col = colorRampPalette(c("darkred", "white", "#483d8b"))(200),
         title = "Correlations |r| > 0.85",
         mar = c(0,0,2,0))

high_corr <- findCorrelation(cor_matrix, cutoff = 0.85, names = TRUE, exact = TRUE)
high_corr 

high_cor_pairs <- which(abs(cor_matrix) > 0.85 & abs(cor_matrix) < 1, arr.ind = TRUE)
high_cor_pairs <- data.frame(
  Var1 = rownames(cor_matrix)[high_cor_pairs[,1]],
  Var2 = colnames(cor_matrix)[high_cor_pairs[,2]],
  Corr = cor_matrix[high_cor_pairs]
) %>% filter(Var1 != Var2) %>% arrange(desc(abs(Corr)))

#(high_cor_pairs)

url_train <- "https://raw.githubusercontent.com/uzmabb182/Data_624_Team_Project_2/refs/heads/main/StudentData.csv"
raw_data <- read.csv(url_train, stringsAsFactors = FALSE)

# Drop rows where target is missing (standard)
raw_data_clean <- raw_data %>% filter(!is.na(PH))

# Split
set.seed(42)
train_index <- createDataPartition(raw_data_clean$PH, p = 0.8, list = FALSE)

train_set <- raw_data_clean[train_index, ]
val_set   <- raw_data_clean[-train_index, ]

#checking size of train and validation set
#nrow(train_set)
#nrow(val_set)

# Isolate only the numeric predictor variables from the training set
# (We exclude PH because it is the target, and Brand.Code because it is categorical)
numeric_train <- train_set %>% select(where(is.numeric), -PH)

# Calculate the correlation matrix
# We use pairwise.complete.obs to safely handle the NAs currently in the data
cor_matrix <- cor(numeric_train, use = "pairwise.complete.obs")

# Find predictors with an absolute correlation greater than 0.85
high_cor_indices <- findCorrelation(cor_matrix, cutoff = 0.85, exact = TRUE)
high_cor_vars <- colnames(numeric_train)[high_cor_indices]

# We will also be removing Alch.Rel as per our EDA analysis
# high_cor_vars <- c(high_cor_vars, "Alch.Rel")

# Vars to remove with high correlation
print(high_cor_vars)

# Lock in the Brand.Code fix for BOTH sets
# (Converting empty strings to true NAs, and making sure they are factors)
train_set_reduced <- train_set_reduced %>%
  mutate(Brand.Code = ifelse(Brand.Code == "", NA, as.character(Brand.Code))) %>%
  mutate(Brand.Code = as.factor(Brand.Code))

val_set_reduced <- val_set_reduced %>%
  mutate(Brand.Code = ifelse(Brand.Code == "", NA, as.character(Brand.Code))) %>%
  mutate(Brand.Code = as.factor(Brand.Code))

# Separate the target variable (PH) so the imputer cannot cheat, preventing data leakage
train_predictors <- train_set_reduced %>% select(-PH)
val_predictors <- val_set_reduced %>% select(-PH)

set.seed(42)

train_imputation <- missForest(train_predictors, 
                               maxiter = 5,    
                               ntree = 100,    
                               verbose = TRUE) 

# Extract the clean data and reattach our target variable
train_set_clean <- train_imputation$ximp
train_set_clean$PH <- train_set_reduced$PH

# See total NAs after imputation
total_nas <- sum(is.na(train_set_clean))
#total_nas

set.seed(42)
# Execute missForest on the validation predictors
val_imputation <- missForest(val_predictors, 
                               maxiter = 5,    
                               ntree = 100,    
                               verbose = TRUE) 

# Extract the clean data
val_set_clean <- val_imputation$ximp

# Put the target variable (PH) safely back into the dataset
val_set_clean$PH <- val_set_reduced$PH

# The Final Check: Verify zero NAs remain
total_val_nas <- sum(is.na(val_set_clean))

cat("Total NAs remaining in the validation set:", total_val_nas, "\n")

# Identify ONLY the numeric predictor columns 
# (We exclude our target 'PH', and 'where(is.numeric)' automatically excludes 'Brand.Code')
numeric_cols <- train_set_clean %>% 
  select(where(is.numeric), -PH) %>% 
  names()

# Build the Scaling Math based only on the Training Set
scale_model <- preProcess(train_set_clean[, numeric_cols], 
                          method = c("center", "scale"))

# Apply the scaling math to the Training Set
train_set_final <- predict(scale_model, train_set_clean)

# Apply the exact same scaling math to the Validation Set
val_set_final <- predict(scale_model, val_set_clean)

#Check the summary of Fill.Ounces to see the new standardized scale
summary(train_set_final$Fill.Ounces)

# Establish the Cross-Validation Framework
set.seed(42)
cv_control <- trainControl(method = "cv", 
                           number = 5, 
                           savePredictions = "final")

# Training Baseline Linear Regression Model
set.seed(42)
lm_model <- train(PH ~ ., 
                  data = train_set_final, 
                  method = "lm", 
                  trControl = cv_control)

lm_metrics <- lm_model$results %>%
  select(RMSE, Rsquared, MAE) %>%
  mutate(
    RMSE = round(RMSE, 4),
    Rsquared = round(Rsquared, 4),
    MAE = round(MAE, 4)
  )

lm_metrics %>%
  kable(caption = "Linear Regression Performance (5-Fold CV)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white") %>%
  add_header_above(c("Performance Metrics" = 3))

# Diagnostic plots for Linear Regression
par(mfrow = c(2, 2))  
plot(lm_model$finalModel)
par(mfrow = c(1, 1))

set.seed(42)

rf_model <- train(PH ~ ., 
                        data = train_set_final, 
                        method = "rf", 
                        trControl = cv_control,
                        tuneLength = 3,    # Tests 3 different 'mtry' tuning parameters
                        importance = TRUE)

print(rf_model)

rf_metrics <- rf_model$results %>%
  select(mtry, RMSE, Rsquared, MAE) %>%
  arrange(RMSE)

best_rf <- rf_metrics %>% slice_min(RMSE)

best_rf %>%
  select(RMSE, Rsquared, MAE) %>%
  mutate(
    RMSE = round(RMSE, 4),
    Rsquared = round(Rsquared, 4),
    MAE = round(MAE, 4)
  ) %>%
  kable(caption = "Random Forest Performance (5-Fold CV)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")

# Get predictions 
rf_predictions <- predict(rf_model, newdata = train_set_final)

# Create a temporary dataframe for plotting only
plot_df <- data.frame(
  PH = train_set_final$PH,
  predicted = rf_predictions
)

# Predicted vs Actual PH
p1 <- ggplot(plot_df, aes(x = PH, y = predicted)) +
  geom_point(alpha = 0.5, color = "#b19cd9") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
  coord_fixed() +
  labs(title = "Predicted v. Actual PH",
       x = "Actual PH", y = "Predicted PH") +
  theme_minimal()

# Residuals vs Predicted PH
plot_df$residuals <- plot_df$PH - plot_df$predicted

p2 <- ggplot(plot_df, aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#b19cd9") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
  labs(title = "Residuals v. Predicted PH",
       x = "Predicted PH", y = "Residuals") +
  theme_minimal()

p1 + p2

# "Training KNN Model
set.seed(42)
knn_model <- train(PH ~ ., 
                   data = train_set_final, 
                   method = "knn",
                   tuneGrid = expand.grid(k = seq(3, 21, by = 2)),
                   trControl = cv_control,
                   metric = "RMSE")

# KNN metrics
knn_metrics <- knn_model$results %>%
  select(k, RMSE, Rsquared, MAE) %>%
  arrange(RMSE)

best_knn <- knn_metrics %>% slice_min(RMSE)

best_knn %>%
  mutate(
    RMSE = round(RMSE, 4),
    Rsquared = round(Rsquared, 4),
    MAE = round(MAE, 4)
  ) %>%
  kable(caption = "KNN Performance (5-Fold CV)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white") %>%
  add_header_above(c("Best k" = 1, "Performance Metrics" = 3))

# Get KNN predictions (store separately)
knn_predictions <- predict(knn_model, newdata = train_set_final)

# Create temporary dataframe for plotting
knn_plot_df <- data.frame(
  PH = train_set_final$PH,
  predicted = knn_predictions
)

# Predicted vs Actual
p1 <- ggplot(knn_plot_df, aes(x = PH, y = predicted)) +
  geom_point(alpha = 0.5, color = "#b19cd9") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
  coord_fixed() +
  labs(title = "KNN: Predicted vs Actual PH",
       x = "Actual PH", y = "Predicted PH") +
  theme_minimal()

# Residuals vs Predicted
knn_plot_df$residuals <- knn_plot_df$PH - knn_plot_df$predicted

p2 <- ggplot(knn_plot_df, aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#b19cd9") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
  labs(title = "KNN: Residuals vs Predicted",
       x = "Predicted PH", y = "Residuals") +
  theme_minimal()
options(repr.plot.width = 12, repr.plot.height = 6)
p1 + p2

X_val_fixed <- val_set_final %>%
  select(-PH) %>%
  select(all_of(names(train_set_final %>% select(-PH))))

# Add back PH
val_set_fixed <- X_val_fixed
val_set_fixed$PH <- val_set_final$PH

cubist_model <- train(PH ~ ., 
                   data = train_set_final, 
                   method = "cubist",
                   trControl = cv_control,
                   tuneLength = 5)

cubist_model$results %>%
  arrange(RMSE) %>%
  slice_head(n = 1) %>%
  select(RMSE, Rsquared, MAE) %>%
  mutate(
    RMSE = round(RMSE, 4),
    Rsquared = round(Rsquared, 4),
    MAE = round(MAE, 4)
  ) %>%
  kable(caption = "Cubist Performance (5-Fold CV)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")

cubist_pred_train <- predict(cubist_model, newdata = train_set_final)

cubist_plot_df <- data.frame(
  PH = train_set_final$PH,
  predicted = cubist_pred_train
)

# Predicted vs Actual plot
p1 <- ggplot(cubist_plot_df, aes(x = PH, y = predicted)) +
  geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
  coord_fixed() +
  labs(title = "Cubist: Predicted vs Actual PH",
       x = "Actual PH", y = "Predicted PH") +
  theme_minimal(base_size = 12)

# Residuals vs Predicted plot
cubist_plot_df$residuals <- cubist_plot_df$PH - cubist_plot_df$predicted

p2 <- ggplot(cubist_plot_df, aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
  labs(title = "Cubist: Residuals vs Predicted",
       x = "Predicted PH", y = "Residuals") +
  theme_minimal(base_size = 12)

p1 + p2

treebag_model <- train(PH ~ ., 
                       data = train_set_final, 
                       method = "treebag",
                       trControl = cv_control,
                       nbagg = 100)

print(treebag_model)

# Extract CV metrics
treebag_cv_metrics <- treebag_model$results %>%
  select(RMSE, Rsquared, MAE) %>%
  mutate(
    RMSE = round(RMSE, 4),
    Rsquared = round(Rsquared, 4),
    MAE = round(MAE, 4)
  )

treebag_cv_metrics %>%
  kable(caption = "Tree Bag Performance (5-Fold CV)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")

treebag_pred_train <- predict(treebag_model, newdata = train_set_final)

treebag_plot_df <- data.frame(
  PH = train_set_final$PH,
  predicted = treebag_pred_train
)

# Predicted vs Actual plot
p1 <- ggplot(treebag_plot_df, aes(x = PH, y = predicted)) +
  geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
  coord_fixed() +
  labs(title = "Tree Bag: Predicted vs Actual PH",
       x = "Actual PH", y = "Predicted PH") +
  theme_minimal(base_size = 12)

# Residuals vs Predicted plot
treebag_plot_df$residuals <- treebag_plot_df$PH - treebag_plot_df$predicted

p2 <- ggplot(treebag_plot_df, aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
  labs(title = "Tree Bag: Residuals vs Predicted",
       x = "Predicted PH", y = "Residuals") +
  theme_minimal(base_size = 12)

p1 + p2

# Ensure train_set_clean is clean
train_set_clean <- train_set_final %>%
  select(-starts_with("rf_"), -starts_with("knn_"), -contains("predicted"), -contains("residuals"))

# One-hot encode Brand.Code
train_xgb <- train_set_clean %>%
  dummy_cols(select_columns = "Brand.Code", remove_first = TRUE) %>%
  select(-Brand.Code)

# Train XGBoost with warnings suppressed
set.seed(42)
xgb_model <- train(PH ~ ., 
                data = train_xgb, 
                method = "xgbTree",
                trControl = cv_control,
                tuneLength = 3,
                verbose = FALSE)

# Extract CV metrics
xgb_model_metrics <- xgb_model$results %>%
  arrange(RMSE) %>%
  slice_head(n = 1) %>%
  select(RMSE, Rsquared, MAE) %>%
  mutate(
    RMSE = round(RMSE, 4),
    Rsquared = round(Rsquared, 4),
    MAE = round(MAE, 4)
  )

xgb_model_metrics %>%
  kable(caption = "XGBoost Performance (5-Fold CV)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")

xgb_pred_train <- predict(xgb_model, newdata = train_xgb)

xgb_plot_df <- data.frame(
  PH = train_xgb$PH,
  predicted = xgb_pred_train
)

# Predicted vs Actual plot
p1 <- ggplot(xgb_plot_df, aes(x = PH, y = predicted)) +
  geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
  coord_fixed() +
  labs(title = "XGBoost: Predicted vs Actual PH",
       x = "Actual PH", y = "Predicted PH") +
  theme_minimal(base_size = 12)

# Residuals vs Predicted plot
xgb_plot_df$residuals <- xgb_plot_df$PH - xgb_plot_df$predicted

p2 <- ggplot(xgb_plot_df, aes(x = predicted, y = residuals)) +
  geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
  labs(title = "XGBoost: Residuals vs Predicted",
       x = "Predicted PH", y = "Residuals") +
  theme_minimal(base_size = 12)

p1 + p2

model_results <- data.frame(
  Model = c("Linear Regression", "Random Forest", "KNN", "Tree Bag", "XGBoost", "Cubist"),
  RMSE = c(
    min(lm_model$results$RMSE),
    min(rf_model$results$RMSE),
    min(knn_model$results$RMSE),
    min(treebag_model$results$RMSE),
    min(xgb_model$results$RMSE),
    min(cubist_model$results$RMSE)
  ),
  Rsquared = c(
    max(lm_model$results$Rsquared),
    max(rf_model$results$Rsquared),
    max(knn_model$results$Rsquared),
    max(treebag_model$results$Rsquared),
    max(xgb_model$results$Rsquared),
    max(cubist_model$results$Rsquared)
  ),
  MAE = c(
    min(lm_model$results$MAE),
    min(rf_model$results$MAE),
    min(knn_model$results$MAE),
    min(treebag_model$results$MAE),
    min(xgb_model$results$MAE),
    min(cubist_model$results$MAE)
  )
)

model_results <- model_results %>%
  mutate(
    RMSE = round(RMSE, 4),
    Rsquared = round(Rsquared, 4),
    MAE = round(MAE, 4)
  ) %>%
  arrange(RMSE)

model_results %>%
  kable(caption = "Model Performance Comparison (5-Fold CV)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white") %>%
  column_spec(1, bold = TRUE) %>%
  add_header_above(c(" " = 1, "Performance Metrics" = 3))

# Calculate the 'Importance' of each variable
# We scale it to 100 so the top sensor is always '100'
rf_importance <- varImp(rf_model, scale = TRUE)

# We only show the Top 15 so the chart stays clean and easy to read
plot(rf_importance, 
     top = 15, 
     main = "Top 15 Predictors of PH",
     col = "#b19cd9")

# Final validation predictions
final_predictions <- predict(rf_model, newdata = val_set_final)

# Calculate metrics
final_score <- postResample(pred = final_predictions, obs = val_set_final$PH)

final_results <- data.frame(
  Value = round(c(final_score[1], final_score[2], final_score[3]), 4)
)

final_results %>%
  kable(caption = "Random Forest Validation Performance") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")

#Pre-processing test data
#Replace all underscores (_) with dots (.) in the column names
names(test_data) <- gsub("_", ".", names(test_data))

# Fix the Brand.Code hidden NAs (Now it will find it!)
test_data <- test_data %>%
  mutate(Brand.Code = ifelse(Brand.Code == "", NA, as.character(Brand.Code))) %>%
  mutate(Brand.Code = as.factor(Brand.Code))

# Drop the exact same redundant variables we dropped from training
test_reduced <- test_data %>% select(-all_of(high_cor_vars))

# Impute missing values using missForest 
set.seed(42)
test_imputation <- missForest(test_reduced, maxiter = 5, ntree = 100, verbose = FALSE)
test_clean <- test_imputation$ximp

# Center and Scale using the TRAINING set's math (scale_model)
test_final <- predict(scale_model, test_clean)

# Here we take Random Forest model, feeds it the test_final dataset
# Random Forest to predict PH for the blind Test Set
test_predictions <- predict(rf_model, newdata = test_final)

submission_file <- data.frame(
  Record_ID = 1:length(test_predictions), 
  Predicted_PH = test_predictions
)

#save_path <- "C:/Users/yu_we/Downloads/ABC_Beverage_Final_Predictions.csv"
#write.csv(submission_file, save_path, row.names = FALSE)

hist_plot <- ggplot(submission_file, aes(x = Predicted_PH)) +
  geom_histogram(fill = "#b19cd9", color = "white", bins = 20) +
  theme_minimal() +
  labs(title = "What the Factory will produce today (Predicted PH)",
       x = "Predicted PH Level",
       y = "Number of Batches")

print(hist_plot)

train_ph <- data.frame(PH_Value = train_set_final$PH, Data_Type = "1. Historical (Training)")
test_ph <- data.frame(PH_Value = submission_file$Predicted_PH, Data_Type = "2. Predicted (Test)")
combined_ph <- rbind(train_ph, test_ph)

overlap_plot <- ggplot(combined_ph, aes(x = PH_Value, fill = Data_Type)) +
  geom_density(alpha = 0.6) + 
  theme_minimal() +
  scale_fill_manual(values = c("1. Historical (Training)" = "darkgray", 
                               "2. Predicted (Test)" = "darkorange")) +
  labs(title = "Prediction v. Reality",
       x = "PH Value",
       y = "Density",
       fill = "Data Source")

print(overlap_plot)