As the data science team at ABC Beverage, we are tasked to provide analysis of the company’s manufacturing process, the predictive factors and our predictive model of PH to meet compliance of the new regulations.
From the summary statistic, we can see that there is 1 categorical variable Brand.Code, our target variable PH, and 31 predictor variables. Additionally, we can see that many of these predictor variables and Brand.Code contain NAs. Notably, in Brand.Code, approximately 48% were categorized into B making the most frequent brand code.
| Brand Code | Frequency | Percent |
|---|---|---|
| 120 | 4.67 | |
| A | 293 | 11.40 |
| B | 1239 | 48.19 |
| C | 304 | 11.82 |
| D | 615 | 23.92 |
| Variable | Count | Mean | Std Dev | Median | Min | Max | NA’s |
|---|---|---|---|---|---|---|---|
| Carb.Volume | 2561 | 5.37 | 0.11 | 5.35 | 5.04 | 5.70 | 10 |
| Fill.Ounces | 2533 | 23.97 | 0.09 | 23.97 | 23.63 | 24.32 | 38 |
| PC.Volume | 2532 | 0.28 | 0.06 | 0.27 | 0.08 | 0.48 | 39 |
| Carb.Pressure | 2544 | 68.19 | 3.54 | 68.20 | 57.00 | 79.40 | 27 |
| Carb.Temp | 2545 | 141.09 | 4.04 | 140.80 | 128.60 | 154.00 | 26 |
| PSC | 2538 | 0.08 | 0.05 | 0.08 | 0.00 | 0.27 | 33 |
| PSC.Fill | 2548 | 0.20 | 0.12 | 0.18 | 0.00 | 0.62 | 23 |
| PSC.CO2 | 2532 | 0.06 | 0.04 | 0.04 | 0.00 | 0.24 | 39 |
| Mnf.Flow | 2569 | 24.57 | 119.48 | 65.20 | -100.20 | 229.40 | 2 |
| Carb.Pressure1 | 2539 | 122.59 | 4.74 | 123.20 | 105.60 | 140.20 | 32 |
| Fill.Pressure | 2549 | 47.92 | 3.18 | 46.40 | 34.60 | 60.40 | 22 |
| Hyd.Pressure1 | 2560 | 12.44 | 12.43 | 11.40 | -0.80 | 58.00 | 11 |
| Hyd.Pressure2 | 2556 | 20.96 | 16.39 | 28.60 | 0.00 | 59.40 | 15 |
| Hyd.Pressure3 | 2556 | 20.46 | 15.98 | 27.60 | -1.20 | 50.00 | 15 |
| Hyd.Pressure4 | 2541 | 96.29 | 13.12 | 96.00 | 52.00 | 142.00 | 30 |
| Filler.Level | 2551 | 109.25 | 15.70 | 118.40 | 55.80 | 161.20 | 20 |
| Filler.Speed | 2514 | 3687.20 | 770.82 | 3982.00 | 998.00 | 4030.00 | 57 |
| Temperature | 2557 | 65.97 | 1.38 | 65.60 | 63.60 | 76.20 | 14 |
| Usage.cont | 2566 | 20.99 | 2.98 | 21.79 | 12.08 | 25.90 | 5 |
| Carb.Flow | 2569 | 2468.35 | 1073.70 | 3028.00 | 26.00 | 5104.00 | 2 |
| Density | 2570 | 1.17 | 0.38 | 0.98 | 0.24 | 1.92 | 1 |
| MFR | 2359 | 704.05 | 73.90 | 724.00 | 31.40 | 868.60 | 212 |
| Balling | 2570 | 2.20 | 0.93 | 1.65 | -0.17 | 4.01 | 1 |
| Pressure.Vacuum | 2571 | -5.22 | 0.57 | -5.40 | -6.60 | -3.60 | 0 |
| PH | 2567 | 8.55 | 0.17 | 8.54 | 7.88 | 9.36 | 4 |
| Oxygen.Filler | 2559 | 0.05 | 0.05 | 0.03 | 0.00 | 0.40 | 12 |
| Bowl.Setpoint | 2569 | 109.33 | 15.30 | 120.00 | 70.00 | 140.00 | 2 |
| Pressure.Setpoint | 2559 | 47.62 | 2.04 | 46.00 | 44.00 | 52.00 | 12 |
| Air.Pressurer | 2571 | 142.83 | 1.21 | 142.60 | 140.80 | 148.20 | 0 |
| Alch.Rel | 2562 | 6.90 | 0.51 | 6.56 | 5.28 | 8.62 | 9 |
| Carb.Rel | 2561 | 5.44 | 0.13 | 5.40 | 4.96 | 6.06 | 10 |
| Balling.Lvl | 2570 | 2.05 | 0.87 | 1.48 | 0.00 | 3.66 | 1 |
We check for any duplicated rows in the train data through the duplicated function. This is to inform us if any duplicated data needs to be handled with appropriately. From the output, we can see that there is no duplicated data. So, next we can begin to check for the percentage of missing data in our dataset, so that we can begin uncovering any patterns related to the missingness.
sum(duplicated(train_data))
## [1] 0
We conduct a quick NHANES check to see which type of data mechanism it could possibly be. To do this, we can check the missing data mechanism through summary statistics. We do so through looking at the overall missingness per variable, which allows us to identify why the data is missing. From the summary statistic, we are able to see that the missing data is likely MAR (Missing At Random) or MCAR (Missing Completely At Random), but not MNAR (Missing Not At Random). This is due to MFR containing the highest number of NAs (8.2%), which tells us that missingness is not uniform across variables and that most other variables have < 2% missingness.
| Variable | Percent_NA |
|---|---|
| MFR | 8.2 |
| Filler.Speed | 2.2 |
| Fill.Ounces | 1.5 |
| PC.Volume | 1.5 |
| PSC.CO2 | 1.5 |
| PSC | 1.3 |
| Carb.Pressure1 | 1.2 |
| Hyd.Pressure4 | 1.2 |
| Carb.Pressure | 1.1 |
| Carb.Temp | 1.0 |
| PSC.Fill | 0.9 |
| Fill.Pressure | 0.9 |
| Filler.Level | 0.8 |
| Hyd.Pressure2 | 0.6 |
| Hyd.Pressure3 | 0.6 |
| Temperature | 0.5 |
| Oxygen.Filler | 0.5 |
| Pressure.Setpoint | 0.5 |
| Carb.Volume | 0.4 |
| Hyd.Pressure1 | 0.4 |
| Alch.Rel | 0.4 |
| Carb.Rel | 0.4 |
| Usage.cont | 0.2 |
| PH | 0.2 |
| Mnf.Flow | 0.1 |
| Carb.Flow | 0.1 |
| Bowl.Setpoint | 0.1 |
| Brand.Code | 0.0 |
| Density | 0.0 |
| Balling | 0.0 |
| Pressure.Vacuum | 0.0 |
| Air.Pressurer | 0.0 |
| Balling.Lvl | 0.0 |
To validate that the data mechanism is MAR, we can check if MFR against the mean temperature. From the summary, we can see that when MFR is not missing, the mean temperature is lower than when the MFR is missing. This tells us that there’s around a difference of 1.88 degrees Fahrenheit higher mean Temperature when MFR is missing, so MFR does correlates with higher temperature.
| MFR_missing | mean_Temp | n |
|---|---|---|
| FALSE | 65.82 | 2359 |
| TRUE | 67.70 | 212 |
Additionally, to validate, we can run a Welch Two Sample t-test to check if it is a significant difference. By rejecting the Null Hypothesis that the null values are MCAR, we accept the alternative that the data mechanism is MAR. Telling us that the missingness has a pattern.
t.test(Temperature ~ MFR_missing, data = train_data)
##
## Welch Two Sample t-test
##
## data: Temperature by MFR_missing
## t = -9.7049, df = 202.45, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## -2.254130 -1.492855
## sample estimates:
## mean in group FALSE mean in group TRUE
## 65.82247 67.69596
The vis_miss function from the naniar library visualizes any patterns or relationships that may need to be preserved. In the plots below, it confirms our summary statistics as we observed that around 1% of the data contains NAs or missing data. 8% of the missing data being from MFR. Other predictors have missingness < 2.5%.
From our analysis of the missingness in the dataset, we can conclude that the data mechanism is MAR. Since it is MAR, it is suggested that handle the missing data through multiple imputation such as missForest, which is able to iteratively predicting missing values using random forest models. MissForest works well in our case due to working with various kinds of continuous and categorical variables, creating a single completed data set by iteratively predicting missing values.
In the histograms below, we observe that predictors have varied distributions as some are normal, some are bi-modal, some are skewed. Additionally, predictors have different units and scales (e.g., Carb.Flow in the 1000s, PSC in decimals of 0.1). For distance-based models, this scale difference would incorrectly weight predictors with larger numbers as more important. So, we will apply standardization to all numeric predictors to ensure equal contribution.
Additionally, we check for outliers in the predictors through the box plot below.
It is notable that Fill.Ounces, PC.Volume, PSC, Carb.Pressure1, Fill.Pressure, Temperature, MFR, Oxygen.Filler, Air.Pressurer, Filler.Speed contain a large number of outliers mostly skewed, so we will check if these variables have real impact on the target variable PH.
Upon closer inspection, the predictor variables from the scatter plot below reveals many things in relation to the PH, but first, let’s take a look at the predictors Temperature, Air.Pressurer and Filler.Speed.
In the individual scatter plot of Temperature, Air.Pressurer and Filler.Speed, they show that they act more like a factor/category than numeric. For these predictors, it could possibly be controlled by a discrete machine or factory setting. For Filler.Speed, the outliers scattered can be presumed to be from the machine alternating between 1000 and 4000. Since these predictors form vertical bands, it is suggested that they don’t have high correlation to the PH. However, we can create a correlation table between these predictors and our target variable to verify.
By creating a correlation table, we can see that Air.Pressure and Filler.Speed contain p values > 0.05, signifying no meaningful linear relationship with PH. Temperature has a weak negative correlation to PH with (r = -0.158). While the correlation table shows no meaningful linear relationship, it is not recommended to remove these predictors based on p-values alone, as it disregards non-linear patterns. Additionally, predictors can have meaningful interactions with other predictors to predict PH well. This correlation table is just to test the linear relationships the predictors have on PH and to inform the outlier detection process.
## Variable Correlation P_Value Significant
## cor Temperature -0.158 8.592733e-16 Yes
## cor1 Air.Pressurer -0.014 4.891116e-01 No
## cor2 Filler.Speed -0.025 2.183784e-01 No
Afterwards, we can begin to do more analysis of the predictors left: Temperature, Fill.Ounces, PC.Volume, PSC, Carb.Pressure1, Oxygen.Filler, MFR and Fill.Pressure. We will use the IQR rule for detecting outliers outside the IQR range (Q1 - 1.5 * IQR AND Q3 + 1.5 * IQR).
From the graph, we see:
Fill.Ounces, PC.Volume, PSC,
Carb.Pressure1 and Fill.Pressure have outliers
that aren’t extremely far from the IQR.Oxygen.Filler, MFR and
Temperature suggests extreme outliers.From our previous observations, we note that Temperature acts more like a factor/category than numeric predictor.
The above scatterplot has overplotting issues due to the data points clustering in a cloud-like formation, so we decided on randomly sampling 500 rows of the dataset to reduce this issue. From the randomly sampled scatter plot of the predictors, we can see that:
MFR and Fill.Pressure exhibits
factor/category likeness, rather than numeric likeness. Distribution
across the PH regression line suggestions no to weak correlation.Fill.Ounces, PC.Volume, PSC,
Carb.Pressure1 and Oxygen.Filler points on the
scatterplot exhibit weak/no relationship to the regression line of
PH.We can additionally check MFR and Fill.Pressure for their correlation to PH as they exhibit vertical clustering. From the table, we can see that MFR has a p values > 0.05, signifying no meaningful linear relationship with PH. While Fill.Pressure’s p-value is < 0.05 and signifies linear correlation to PH.
## Variable Correlation P_Value Significant
## cor MFR -0.009 6.721382e-01 No
## cor1 Fill.Pressure -0.213 1.199531e-27 Yes
From analyzing the predictors with seemingly large number of outliers or extreme outliers, it should be noted that none of the tested predictors have a notable significance to PH. Additionally, predictors such as MFR, Fill.Pressure, Temperature, Air.Pressure and Filler.Speed exhibits factor/category likeness, rather than numeric likeness and their range outliers most likely stem from the machine or factory setting.
Afterwards, we can check if the categorical variable Brand.Code has any significant predictive effects on the target variable PH. This is to inform us whether we should keep our categorical variable or not. Through the boxplot, we can see that each brand (including NAs) have varying median lines, indicating that they are different types of beverages that potentially come from various manufacturing processes. Interestingly, the NAs Brand.Code distribution ties closely to the distribution of all over brand.codes.
Let’s confirm that each brand has a statistically different PH distribution from every other brand through a mann-Whitney U test. As shown below, we can see that each pairwise comparison of Brand.Code has a statistically significant pH value (all p-values < 0.05 after Bonferroni correction), indicting that the brands represent different beverage types or formulations, not only a minor change/variation.
#mann-Whitney U test pairwise comparisons, labeled brand.code nas as missing
ph_by_brand <- train_data %>%
mutate(Brand.Code = forcats::fct_na_value_to_level(Brand.Code, "Missing"))
#https://www.statology.org/bonferroni-correction/
#pairwise Mann-Whitney U test with bonferroni
#in Bonferroni correction each raw p-value is multiplied by ( m ) (compare each p-value to (\alpha/m)).
#ensures strong control of the family-wise error rate.
pairwise_results <- pairwise.wilcox.test(
ph_by_brand$PH,
ph_by_brand$Brand.Code,
p.adjust.method = "bonferroni",
paired = FALSE
)
print(pairwise_results)
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: ph_by_brand$PH and ph_by_brand$Brand.Code
##
## A B C
## A 1.00000 - - -
## B 0.00016 2.0e-07 - -
## C 0.00165 3.6e-09 < 2e-16 -
## D 4.2e-10 < 2e-16 0.00058 < 2e-16
##
## P value adjustment method: bonferroni
From an ANOVA test of the Brand.Code to see the effects of Brand.Code on PH, we can see that p < 2e-16, signifying it as highly significant. Additionally, Brand.Code explains 11.7% of the variance in PH. Because of this, we decided to keep Brand.Code.
aov_result <- aov(PH ~ Brand.Code, data = train_data)
summary(aov_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Brand.Code 4 8.90 2.2258 84.52 <2e-16 ***
## Residuals 2562 67.47 0.0263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 4 observations deleted due to missingness
ss <- summary(aov_result)[[1]]$"Sum Sq"
r_sq <- ss[1] / sum(ss)
#variance by brand.code
round(r_sq * 100, 2)
## [1] 11.66
Before we begin to check the correlation of predictors, it is important to exclude PH from the training data so that it doesn’t affect any of the predictor correlations. After doing so, we can start analyzing the relationships of the variables. Through the correlation plot below, we can see that variables such as Hy.Pressure3, Filler.Level, Filler.Speed, Density, Balling, Alch.Rel and Carb.Rel have correlations > 85%. Before we remove the variables, however, we can use the function findCorrelation() to match our correlation plot.
From the findCorrelation function, we can see that they match the analysis of the correlation plot, with the exception of Carb.Rel. This is likely due to carb.rel having an average correlation that is less than 85% with the other predictors. As such, it will not be a candidate for removal. However, the rest of the predictors: “Balling”, “Hyd.Pressure3”, “Alch.Rel”, “Balling.Lvl”, “Density”, “Filler.Level” and “Filler.Speed” is recommended.
## [1] "Balling" "Hyd.Pressure3" "Balling.Lvl" "Alch.Rel"
## [5] "Density" "Filler.Level" "Filler.Speed"
Balling, Hyd.Pressure3,
Alch.Rel, Balling.Lvl,
Density,Filler.Level, and
Filler.Speed due to high correlation with other
predictors.MFR, Fill.Pressure, Temperature,
Air.Pressure, Filler.Speed) show non-normal
distributions and non/weak linear correlations with PH. In addition to
exhibiting patterns as that of an ordinal categorical variable. This
approach could be less sensitive to outliers and can capture non-linear
patterns without requiring transformations.Brand.Code, missForest allows for
working with various kinds of continuous and categorical variables,
creating a single completed data set by iteratively predicting missing
values.Note: Brand.Code most likely represents different beverage types or formulations, not only a minor change/variation.
Before we impute missing values, handle outliers, or transform features, we must split our dataset. So, we partitioned into training and validation subsets by randomly selecting 80% for training and leaving the remaining 20% set aside for testing. If we calculate global metrics like means for imputation or correlation thresholds on the entire dataset, information from the validation set will leak into the training process. By splitting first, we ensure our preprocessing rules are learned only from the training data and then blindly applied to the validation data, giving us a true measure of how our model will perform in the real world.
url_train <- "https://raw.githubusercontent.com/uzmabb182/Data_624_Team_Project_2/refs/heads/main/StudentData.csv"
raw_data <- read.csv(url_train, stringsAsFactors = FALSE)
# Drop rows where target is missing (standard)
raw_data_clean <- raw_data %>% filter(!is.na(PH))
# Split
set.seed(42)
train_index <- createDataPartition(raw_data_clean$PH, p = 0.8, list = FALSE)
train_set <- raw_data_clean[train_index, ]
val_set <- raw_data_clean[-train_index, ]
#checking size of train and validation set
#nrow(train_set)
#nrow(val_set)
missForest is running a full machine learning prediction process for every single column that has an NA. It treats the missing data as the target to be predicted, and uses all the non-missing data as the clues.
set.seed(42)
train_imputation <- missForest(train_predictors,
maxiter = 5,
ntree = 100,
verbose = TRUE)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.1677212 0.1346252
## difference(s): 0.0006085918 0.02481752
## time: 1.69 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.1637871 0.1371749
## difference(s): 6.235666e-06 0.00973236
## time: 2.09 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.1611763 0.1336053
## difference(s): 5.843416e-06 0.009245742
## time: 2.31 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.1592056 0.1341152
## difference(s): 5.449064e-06 0.01021898
## time: 2.05 seconds
##
## missForest iteration 5 in progress...done!
## estimated error(s): 0.159595 0.1315655
## difference(s): 3.784334e-06 0.01021898
## time: 2.05 seconds
# Extract the clean data and reattach our target variable
train_set_clean <- train_imputation$ximp
train_set_clean$PH <- train_set_reduced$PH
The future machine learning model needs clean, complete data to make its predictions. If we leave NAs in the validation set, the model will crash when we try to test it later. We will use the exact same missForest algorithm we did on the training set to fill in the blanks here.
## missForest iteration 1 in progress...done!
## estimated error(s): 0.2261653 0.1769547
## difference(s): 0.0001755827 0.0234375
## time: 0.54 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.2213659 0.1831276
## difference(s): 6.491444e-06 0.005859375
## time: 0.51 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.2282613 0.1666667
## difference(s): 3.899296e-06 0.005859375
## time: 0.53 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.2320909 0.1748971
## difference(s): 2.175098e-06 0.00390625
## time: 0.52 seconds
##
## missForest iteration 5 in progress...done!
## estimated error(s): 0.2269774 0.1687243
## difference(s): 3.023564e-06 0.0078125
## time: 0.55 seconds
## Total NAs remaining in the validation set: 0
Our EDA shows predictors have different units and scales. If we use a distance-based machine learning model (like K-Nearest Neighbors, Neural Networks, or Support Vector Machines), the algorithm will look at those numbers and assume that Pressure is the most important variable simply because the numbers are bigger.
To fix this, we standardizes the data:
We must calculate the average and standard deviation ONLY on the Training Set, and then apply that exact same math to the Validation Set. If we calculate the validation set’s own average, we risk leaking future data into our model.
# Identify ONLY the numeric predictor columns
# (We exclude our target 'PH', and 'where(is.numeric)' automatically excludes 'Brand.Code')
numeric_cols <- train_set_clean %>%
select(where(is.numeric), -PH) %>%
names()
# Build the Scaling Math based only on the Training Set
scale_model <- preProcess(train_set_clean[, numeric_cols],
method = c("center", "scale"))
# Apply the scaling math to the Training Set
train_set_final <- predict(scale_model, train_set_clean)
# Apply the exact same scaling math to the Validation Set
val_set_final <- predict(scale_model, val_set_clean)
#Check the summary of Fill.Ounces to see the new standardized scale
summary(train_set_final$Fill.Ounces)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.98555 -0.63492 -0.01155 0.00000 0.61183 4.04038
Before training the other models, setting up a baseline Linear Regression provides a critical benchmark. If a highly complex algorithm only marginally improves the RMSE (Root Mean Squared Error) over a basic linear model, the computational cost might not be justified.
To ensure all models are evaluated fairly without touching the isolated validation set, K-Fold Cross-Validation (5 folds) is implemented during the training phase.
# Establish the Cross-Validation Framework
set.seed(42)
cv_control <- trainControl(method = "cv",
number = 5,
savePredictions = "final")
# Training Baseline Linear Regression Model
set.seed(42)
lm_model <- train(PH ~ .,
data = train_set_final,
method = "lm",
trControl = cv_control)
| RMSE | Rsquared | MAE |
|---|---|---|
| 0.1377 | 0.3625 | 0.1077 |
In our baseline model, we can see its performance:
RMSE: 0.1377Rsquared: 0.36MAE: 0.1077Though our linear regression model was fast to train, the baseline model only explains 36% of the variance in PH. The typical prediction error is ±0.1377 PH units. For beverage manufacturing quality control, these error margins are unacceptable as they can disrupt processes, damage equipment, and compromise product quality or safety.
As such, we will only be using this as comparison to the other models we will be evaluating.
Our diagnostic plots reveal that linear regression violates multiple statistical assumptions, confirming it is inappropriate for this data:
Using linear regression would lead to unreliable PH predictions, equipment damage, and safety and quality control issues.
set.seed(42)
rf_model <- train(PH ~ .,
data = train_set_final,
method = "rf",
trControl = cv_control,
tuneLength = 3, # Tests 3 different 'mtry' tuning parameters
importance = TRUE)
print(rf_model)
## Random Forest
##
## 2055 samples
## 25 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1643, 1644, 1645, 1644, 1644
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1185277 0.5693689 0.09036701
## 14 0.1072305 0.6233415 0.07870450
## 27 0.1076169 0.6138156 0.07774770
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.
| RMSE | Rsquared | MAE |
|---|---|---|
| 0.1072 | 0.6233 | 0.0787 |
In our random forest model, we can see its performance:
RMSE: 0.1072Rsquared: 0.62MAE: 0.0777Our random forest model explains ~62% of variance in PH. The typical prediction error is ±0.1072 PH units.
From our diagnostics of the Random Forest model below, we can see:
# Get predictions
rf_predictions <- predict(rf_model, newdata = train_set_final)
# Create a temporary dataframe for plotting only
plot_df <- data.frame(
PH = train_set_final$PH,
predicted = rf_predictions
)
# Predicted vs Actual PH
p1 <- ggplot(plot_df, aes(x = PH, y = predicted)) +
geom_point(alpha = 0.5, color = "#b19cd9") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
coord_fixed() +
labs(title = "Predicted v. Actual PH",
x = "Actual PH", y = "Predicted PH") +
theme_minimal()
# Residuals vs Predicted PH
plot_df$residuals <- plot_df$PH - plot_df$predicted
p2 <- ggplot(plot_df, aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.5, color = "#b19cd9") +
geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
labs(title = "Residuals v. Predicted PH",
x = "Predicted PH", y = "Residuals") +
theme_minimal()
p1 + p2
# "Training KNN Model
set.seed(42)
knn_model <- train(PH ~ .,
data = train_set_final,
method = "knn",
tuneGrid = expand.grid(k = seq(3, 21, by = 2)),
trControl = cv_control,
metric = "RMSE")
| k | RMSE | Rsquared | MAE |
|---|---|---|---|
| 11 | 0.1292 | 0.4503 | 0.0973 |
In our K-Nearest Neighbor model, we can see its performance:
RMSE: 0.1292Rsquared: 0.45MAE: 0.0947Our K-Nearest Neighbor model explains ~45% of variance in PH. The typical prediction error is ±0.1292 PH units. We will make comparisons to the baseline model, after all models have been evaluated.
KNN is an improvement over the linear regression model.
From our diagnostics of the K-Nearest Neighbor Model below, we can see:
X_val_fixed <- val_set_final %>%
select(-PH) %>%
select(all_of(names(train_set_final %>% select(-PH))))
# Add back PH
val_set_fixed <- X_val_fixed
val_set_fixed$PH <- val_set_final$PH
cubist_model <- train(PH ~ .,
data = train_set_final,
method = "cubist",
trControl = cv_control,
tuneLength = 5)
| RMSE | Rsquared | MAE |
|---|---|---|
| 0.1084 | 0.6072 | 0.0789 |
In our Cubist model, we can see its performance:
RMSE:0.1084Rsquared: 0.61MAE: 0.0789Our Cubist model explains ~61% of variance in PH. The typical prediction error is ±0.1084 PH units.
From our diagnostics of the Cubist model below, we can see:
treebag_model <- train(PH ~ .,
data = train_set_final,
method = "treebag",
trControl = cv_control,
nbagg = 100)
print(treebag_model)
## Bagged CART
##
## 2055 samples
## 25 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1643, 1645, 1643, 1645, 1644
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1264163 0.4665561 0.09760613
| RMSE | Rsquared | MAE |
|---|---|---|
| 0.1264 | 0.4666 | 0.0976 |
In our Treebag model, we can see its performance:
RMSE:0.1264Rsquared: 0.47MAE: 0.0976Our Treebag model explains ~47% of variance in PH. The typical prediction error is ±0.1264 PH units.
From our diagnostics of the Treebag model below, we can see:
| RMSE | Rsquared | MAE |
|---|---|---|
| 0.1193 | 0.5219 | 0.0906 |
In our XGBoost model, we can see its performance:
RMSE:0.1193Rsquared: 0.52MAE: 0.0905Our Treebag model explains ~52% of variance in PH. The typical prediction error is ±0.12 PH units.
From our diagnostics of the XGBoost model below, we can see:
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| Random Forest | 0.1072 | 0.6233 | 0.0777 |
| Cubist | 0.1084 | 0.6072 | 0.0789 |
| XGBoost | 0.1193 | 0.5219 | 0.0905 |
| Tree Bag | 0.1264 | 0.4666 | 0.0976 |
| KNN | 0.1292 | 0.4503 | 0.0947 |
| Linear Regression | 0.1377 | 0.3625 | 0.1077 |
From our model comparison summary above, there are differences between models.
Due to Random Forest containing the highest Rsquared and lowest RSME and MAE, it is recommended for production deployment. It explains 62% of PH variance with typical prediction errors of ±0.11 PH units. While further improvement is possible, this model provides the most reliable predictions for quality control in beverage manufacturing.
With Random Forest selected as our final model, we examine predictor importance to understand which variables drive PH predictions. This reveals the key sensors and process measurements that most influence beverage quality.
From the plot of the top 15 predictors of PH above, it reveals clear rankings of influence on PH:
Mnf.Flow: Is the most dominant predictor of PH.
Manufacturing flow rate has the strongest influence on PH more than any
other process measurement.Brand.CodeC: Brand C has a distinct chemical profile
compared to other brands.Usage.cont: Continuous usage measurements moderately
influence PH.Pressure.Vacuum: Vacuum pressure conditions affect
acidity levels.The remaining top 15 predictors show modest but meaningful contributions to PH prediction and not primary drivers.
We now test the Random Forest model on the 20% validation set.
# Final validation predictions
final_predictions <- predict(rf_model, newdata = val_set_final)
# Calculate metrics
final_score <- postResample(pred = final_predictions, obs = val_set_final$PH)
final_results <- data.frame(
Value = round(c(final_score[1], final_score[2], final_score[3]), 4)
)
final_results %>%
kable(caption = "Random Forest Validation Performance") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")
| Value | |
|---|---|
| RMSE | 0.0992 |
| Rsquared | 0.6804 |
| MAE | 0.0744 |
## removed variable(s) 21 due to the missingness of all entries
After predicting the test data with our random forest model, we can create some plots to visualize typical factory predictions.
In our histogram of our predicted PH below, we can anticipate:
hist_plot <- ggplot(submission_file, aes(x = Predicted_PH)) +
geom_histogram(fill = "#b19cd9", color = "white", bins = 20) +
theme_minimal() +
labs(title = "What the Factory will produce today (Predicted PH)",
x = "Predicted PH Level",
y = "Number of Batches")
print(hist_plot)
In the graph below, we can determine whether the predictions look normal:
train_ph <- data.frame(PH_Value = train_set_final$PH, Data_Type = "1. Historical (Training)")
test_ph <- data.frame(PH_Value = submission_file$Predicted_PH, Data_Type = "2. Predicted (Test)")
combined_ph <- rbind(train_ph, test_ph)
overlap_plot <- ggplot(combined_ph, aes(x = PH_Value, fill = Data_Type)) +
geom_density(alpha = 0.6) +
theme_minimal() +
scale_fill_manual(values = c("1. Historical (Training)" = "darkgray",
"2. Predicted (Test)" = "darkorange")) +
labs(title = "Prediction v. Reality",
x = "PH Value",
y = "Density",
fill = "Data Source")
print(overlap_plot)
Random Forest was selected as the final model, achieving the highest
R² and lowest RMSE among all candidates. Its top predictors
Mnf.Flow, Brand.CodeC,
Usage.cont, and Pressure.Vacuum represent the
key process variables governing beverage pH.
Now that we have our final predictions, the ABC Beverage data science team has identified several next steps:
The appendix includes all the code included to build this technical report.
# Load libraries
library(tidyverse)
library(mice) # For advanced imputation
library(caret) # For data splitting and pre-processing
#library(rcompanion)
library(ggplot2)
library(UpSetR) # Upset plot
library(corrplot) # corelation plot
library(tidyr)
library(patchwork)
library(kableExtra)
library(naniar)
library(psych)
library(dplyr)
library(rstatix)
library(mice)
library(dplyr)
library(visdat)
library(missForest)
library(xgboost)
library(randomForest)
library(fastDummies)
library(patchwork)
# Load the datasets
train_data <- read.csv("https://raw.githubusercontent.com/uzmabb182/Data_624_Team_Project_2/refs/heads/main/StudentData.csv", stringsAsFactors = FALSE)
test_data <- read.csv("https://raw.githubusercontent.com/uzmabb182/Data_624_Team_Project_2/refs/heads/main/StudentEvaluation.csv", stringsAsFactors = FALSE)
cat_summary <- train_data %>%
count(Brand.Code) %>%
mutate(
Percentage = round(n / sum(n) * 100, 2)
) %>%
rename(
"Brand Code" = Brand.Code,
"Frequency" = n,
"Percent" = Percentage
)
cat_summary %>%
kable(caption = "Descriptive Stats Brand.Code") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
numeric_vars <- train_data %>% select(where(is.numeric))
num_summary <- describe(numeric_vars) %>%
rownames_to_column("Variable") %>%
mutate(NA_Count = sapply(numeric_vars, function(x) sum(is.na(x)))) %>%
select(Variable, n, mean, sd, median, min, max, NA_Count) %>%
mutate(across(where(is.numeric), ~ round(., 2))) %>%
rename(
"Count" = n,
"Mean" = mean,
"Std Dev" = sd,
"Median" = median,
"Min" = min,
"Max" = max,
"NA's" = NA_Count
)
num_summary %>%
kable(caption = "Descriptive Stats Numeric Predictors") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
#summary(train_data)
sum(duplicated(train_data))
na_summary <- train_data %>%
summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
tidyr::pivot_longer(everything(), names_to = "Variable", values_to = "Percent_NA") %>%
mutate(Percent_NA = round(Percent_NA, 1)) %>%
arrange(desc(Percent_NA))
na_summary %>%
kable(caption = "Missingness per Variable") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
train_data <- train_data %>%
mutate(MFR_missing = is.na(MFR))
mar_check <- train_data %>%
group_by(MFR_missing) %>%
summarise(mean_Temp = round(mean(Temperature, na.rm = TRUE), 2),
n = n())
mar_check %>%
kable(caption = "Mean Temperature by MFR Missing Status") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
t.test(Temperature ~ MFR_missing, data = train_data)
vis_miss(train_data) +
coord_flip()
#histogram
train_data %>%
select(where(is.numeric)) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "#b19cd9", alpha = 0.7) +
facet_wrap(~name, scales = "free", ncol = 4) +
theme_minimal()
all_numeric <- train_data %>%
select(where(is.numeric), -PH) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value")
var_list <- unique(all_numeric$variable)
plots <- list()
for(i in seq_along(var_list)) {
plots[[i]] <- ggplot(all_numeric %>% filter(variable == var_list[i]),
aes(x = variable, y = value)) +
geom_boxplot(fill = "#b19cd9", color = "#483d8b", alpha = 0.7,
outlier.color = "darkred", outlier.size = 0.8) +
labs(title = var_list[i], x = "", y = "") +
theme_minimal(base_size = 8) +
theme(plot.title = element_text(size = 9, hjust = 0.5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
}
wrap_plots(plots, ncol = 5)
all_numeric <- train_data %>%
select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value")
#predictors with high num of outliers
selected_vars <- c("Fill.Ounces", "PC.Volume", "PSC", "Carb.Pressure1",
"Fill.Pressure", "Temperature", "MFR", "Oxygen.Filler", "Air.Pressurer", "Filler.Speed")
plots <- list()
for(i in seq_along(selected_vars)) {
var <- selected_vars[i]
plot_data <- all_numeric %>% filter(variable == var)
plots[[i]] <- ggplot(plot_data, aes(x = variable, y = value)) +
geom_boxplot(fill = "#b19cd9", color = "#483d8b", alpha = 0.7,
outlier.color = "darkred", outlier.size = 0.8) +
labs(title = var, x = "", y = "") +
theme_minimal(base_size = 8) +
theme(plot.title = element_text(size = 9, hjust = 0.5),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
}
wrap_plots(plots, ncol = 3)
for(i in seq_along(selected_vars)) {
var <- selected_vars[i]
plots[[i]] <- ggplot(train_data, aes(x = .data[[var]], y = PH)) +
geom_point(alpha = 0.4, size = 1) +
geom_smooth(method = "lm", se = FALSE, color = "darkred") +
theme_minimal(base_size = 8) +
labs(title = var, x = "", y = "PH") +
theme(plot.title = element_text(size = 9, hjust = 0.5))
}
wrap_plots(plots, ncol = 3)
plot(train_data$Temperature, train_data$PH,
main = "Temperature vs PH",
xlab = "Temperature", ylab = "PH")
plot(train_data$Air.Pressurer, train_data$PH,
main = "Air.Pressurer vs PH",
xlab = "Air.Pressurer", ylab = "PH")
plot(train_data$Filler.Speed, train_data$PH,
main = "Filler.Speed vs PH",
xlab = "Filler.Speed", ylab = "PH")
vars <- c("Temperature", "Air.Pressurer", "Filler.Speed")
cor_table <- data.frame()
for(var in vars) {
test <- cor.test(train_data[[var]], train_data$PH, use = "complete.obs")
cor_table <- rbind(cor_table, data.frame(
Variable = var,
Correlation = round(test$estimate, 3),
P_Value = test$p.value,
Significant = ifelse(test$p.value < 0.05, "Yes", "No")
))
}
cor_table$P_Value <- format(cor_table$P_Value, scientific = TRUE)
print(cor_table)
vars <- c("Fill.Ounces", "PC.Volume", "PSC", "Carb.Pressure1", "Oxygen.Filler", "MFR", "Fill.Pressure", "Temperature")
plots <- list()
for(var in vars) {
#iqr
Q1 <- quantile(train_data[[var]], 0.25, na.rm = TRUE)
Q3 <- quantile(train_data[[var]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
#flagging the outliers
plot_data <- train_data %>%
mutate(is_outlier = train_data[[var]] < lower | train_data[[var]] > upper,
outlier_color = ifelse(is_outlier, "Outlier", "Normal"))
p <- ggplot(plot_data, aes(x = .data[[var]], y = PH, color = outlier_color)) +
geom_point(alpha = 0.6, size = 1) +
geom_smooth(method = "lm", color = "#000000", se = FALSE, size = 0.8) +
scale_color_manual(values = c("Normal" = "#b19cd9", "Outlier" = "darkred")) +
theme_minimal(base_size = 8) +
theme(legend.position = "none",
plot.title = element_text(size = 9, hjust = 0.5)) +
labs(title = var, x = "", y = "PH")
plots[[var]] <- p
}
wrap_plots(plots, ncol = 4)
set.seed(5555)
sample_data <- train_data %>% sample_n(500)
plots <- list()
for(var in vars) {
Q1 <- quantile(train_data[[var]], 0.25, na.rm = TRUE)
Q3 <- quantile(train_data[[var]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
plot_data <- sample_data %>%
mutate(is_outlier = sample_data[[var]] < lower | sample_data[[var]] > upper,
outlier_color = ifelse(is_outlier, "Outlier", "Normal"))
p <- ggplot(plot_data, aes(x = .data[[var]], y = PH, color = outlier_color)) +
geom_point(alpha = 0.6, size = 1) +
geom_smooth(method = "lm", color = "darkred", se = FALSE, size = 1) +
scale_color_manual(values = c("Normal" = "#b19cd9", "Outlier" = "darkred")) +
theme_minimal(base_size = 8) +
theme(legend.position = "none",
plot.title = element_text(size = 9, hjust = 0.5)) +
labs(title = var, x = "", y = "PH")
plots[[var]] <- p
}
wrap_plots(plots, ncol = 4)
vars <- c("MFR", "Fill.Pressure")
cor_table <- data.frame()
for(var in vars) {
test <- cor.test(train_data[[var]], train_data$PH, use = "complete.obs")
cor_table <- rbind(cor_table, data.frame(
Variable = var,
Correlation = round(test$estimate, 3),
P_Value = test$p.value,
Significant = ifelse(test$p.value < 0.05, "Yes", "No")
))
}
cor_table$P_Value <- format(cor_table$P_Value, scientific = TRUE)
print(cor_table)
boxplot(PH ~ Brand.Code, data = train_data,
las = 2, xlab = "Brand Code", ylab = "PH",
main = "PH Distribution by Brand Code")
#mann-Whitney U test pairwise comparisons, labeled brand.code nas as missing
ph_by_brand <- train_data %>%
mutate(Brand.Code = forcats::fct_na_value_to_level(Brand.Code, "Missing"))
#https://www.statology.org/bonferroni-correction/
#pairwise Mann-Whitney U test with bonferroni
#in Bonferroni correction each raw p-value is multiplied by ( m ) (compare each p-value to (\alpha/m)).
#ensures strong control of the family-wise error rate.
pairwise_results <- pairwise.wilcox.test(
ph_by_brand$PH,
ph_by_brand$Brand.Code,
p.adjust.method = "bonferroni",
paired = FALSE
)
print(pairwise_results)
aov_result <- aov(PH ~ Brand.Code, data = train_data)
summary(aov_result)
ss <- summary(aov_result)[[1]]$"Sum Sq"
r_sq <- ss[1] / sum(ss)
#variance by brand.code
round(r_sq * 100, 2)
#excludingPH from numeric data
numeric_data <- train_data %>%
select(where(is.numeric), -PH)
cor_matrix <- cor(numeric_data, use = "complete.obs")
cor_matrix_filtered <- cor_matrix
cor_matrix_filtered[abs(cor_matrix_filtered) < 0.85] <- NA
corrplot(cor_matrix_filtered,
method = "color",
type = "upper",
tl.cex = 0.7,
na.label = " ",
col = colorRampPalette(c("darkred", "white", "#483d8b"))(200),
title = "Correlations |r| > 0.85",
mar = c(0,0,2,0))
high_corr <- findCorrelation(cor_matrix, cutoff = 0.85, names = TRUE, exact = TRUE)
high_corr
high_cor_pairs <- which(abs(cor_matrix) > 0.85 & abs(cor_matrix) < 1, arr.ind = TRUE)
high_cor_pairs <- data.frame(
Var1 = rownames(cor_matrix)[high_cor_pairs[,1]],
Var2 = colnames(cor_matrix)[high_cor_pairs[,2]],
Corr = cor_matrix[high_cor_pairs]
) %>% filter(Var1 != Var2) %>% arrange(desc(abs(Corr)))
#(high_cor_pairs)
url_train <- "https://raw.githubusercontent.com/uzmabb182/Data_624_Team_Project_2/refs/heads/main/StudentData.csv"
raw_data <- read.csv(url_train, stringsAsFactors = FALSE)
# Drop rows where target is missing (standard)
raw_data_clean <- raw_data %>% filter(!is.na(PH))
# Split
set.seed(42)
train_index <- createDataPartition(raw_data_clean$PH, p = 0.8, list = FALSE)
train_set <- raw_data_clean[train_index, ]
val_set <- raw_data_clean[-train_index, ]
#checking size of train and validation set
#nrow(train_set)
#nrow(val_set)
# Isolate only the numeric predictor variables from the training set
# (We exclude PH because it is the target, and Brand.Code because it is categorical)
numeric_train <- train_set %>% select(where(is.numeric), -PH)
# Calculate the correlation matrix
# We use pairwise.complete.obs to safely handle the NAs currently in the data
cor_matrix <- cor(numeric_train, use = "pairwise.complete.obs")
# Find predictors with an absolute correlation greater than 0.85
high_cor_indices <- findCorrelation(cor_matrix, cutoff = 0.85, exact = TRUE)
high_cor_vars <- colnames(numeric_train)[high_cor_indices]
# We will also be removing Alch.Rel as per our EDA analysis
# high_cor_vars <- c(high_cor_vars, "Alch.Rel")
# Vars to remove with high correlation
print(high_cor_vars)
# Lock in the Brand.Code fix for BOTH sets
# (Converting empty strings to true NAs, and making sure they are factors)
train_set_reduced <- train_set_reduced %>%
mutate(Brand.Code = ifelse(Brand.Code == "", NA, as.character(Brand.Code))) %>%
mutate(Brand.Code = as.factor(Brand.Code))
val_set_reduced <- val_set_reduced %>%
mutate(Brand.Code = ifelse(Brand.Code == "", NA, as.character(Brand.Code))) %>%
mutate(Brand.Code = as.factor(Brand.Code))
# Separate the target variable (PH) so the imputer cannot cheat, preventing data leakage
train_predictors <- train_set_reduced %>% select(-PH)
val_predictors <- val_set_reduced %>% select(-PH)
set.seed(42)
train_imputation <- missForest(train_predictors,
maxiter = 5,
ntree = 100,
verbose = TRUE)
# Extract the clean data and reattach our target variable
train_set_clean <- train_imputation$ximp
train_set_clean$PH <- train_set_reduced$PH
# See total NAs after imputation
total_nas <- sum(is.na(train_set_clean))
#total_nas
set.seed(42)
# Execute missForest on the validation predictors
val_imputation <- missForest(val_predictors,
maxiter = 5,
ntree = 100,
verbose = TRUE)
# Extract the clean data
val_set_clean <- val_imputation$ximp
# Put the target variable (PH) safely back into the dataset
val_set_clean$PH <- val_set_reduced$PH
# The Final Check: Verify zero NAs remain
total_val_nas <- sum(is.na(val_set_clean))
cat("Total NAs remaining in the validation set:", total_val_nas, "\n")
# Identify ONLY the numeric predictor columns
# (We exclude our target 'PH', and 'where(is.numeric)' automatically excludes 'Brand.Code')
numeric_cols <- train_set_clean %>%
select(where(is.numeric), -PH) %>%
names()
# Build the Scaling Math based only on the Training Set
scale_model <- preProcess(train_set_clean[, numeric_cols],
method = c("center", "scale"))
# Apply the scaling math to the Training Set
train_set_final <- predict(scale_model, train_set_clean)
# Apply the exact same scaling math to the Validation Set
val_set_final <- predict(scale_model, val_set_clean)
#Check the summary of Fill.Ounces to see the new standardized scale
summary(train_set_final$Fill.Ounces)
# Establish the Cross-Validation Framework
set.seed(42)
cv_control <- trainControl(method = "cv",
number = 5,
savePredictions = "final")
# Training Baseline Linear Regression Model
set.seed(42)
lm_model <- train(PH ~ .,
data = train_set_final,
method = "lm",
trControl = cv_control)
lm_metrics <- lm_model$results %>%
select(RMSE, Rsquared, MAE) %>%
mutate(
RMSE = round(RMSE, 4),
Rsquared = round(Rsquared, 4),
MAE = round(MAE, 4)
)
lm_metrics %>%
kable(caption = "Linear Regression Performance (5-Fold CV)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white") %>%
add_header_above(c("Performance Metrics" = 3))
# Diagnostic plots for Linear Regression
par(mfrow = c(2, 2))
plot(lm_model$finalModel)
par(mfrow = c(1, 1))
set.seed(42)
rf_model <- train(PH ~ .,
data = train_set_final,
method = "rf",
trControl = cv_control,
tuneLength = 3, # Tests 3 different 'mtry' tuning parameters
importance = TRUE)
print(rf_model)
rf_metrics <- rf_model$results %>%
select(mtry, RMSE, Rsquared, MAE) %>%
arrange(RMSE)
best_rf <- rf_metrics %>% slice_min(RMSE)
best_rf %>%
select(RMSE, Rsquared, MAE) %>%
mutate(
RMSE = round(RMSE, 4),
Rsquared = round(Rsquared, 4),
MAE = round(MAE, 4)
) %>%
kable(caption = "Random Forest Performance (5-Fold CV)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")
# Get predictions
rf_predictions <- predict(rf_model, newdata = train_set_final)
# Create a temporary dataframe for plotting only
plot_df <- data.frame(
PH = train_set_final$PH,
predicted = rf_predictions
)
# Predicted vs Actual PH
p1 <- ggplot(plot_df, aes(x = PH, y = predicted)) +
geom_point(alpha = 0.5, color = "#b19cd9") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
coord_fixed() +
labs(title = "Predicted v. Actual PH",
x = "Actual PH", y = "Predicted PH") +
theme_minimal()
# Residuals vs Predicted PH
plot_df$residuals <- plot_df$PH - plot_df$predicted
p2 <- ggplot(plot_df, aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.5, color = "#b19cd9") +
geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
labs(title = "Residuals v. Predicted PH",
x = "Predicted PH", y = "Residuals") +
theme_minimal()
p1 + p2
# "Training KNN Model
set.seed(42)
knn_model <- train(PH ~ .,
data = train_set_final,
method = "knn",
tuneGrid = expand.grid(k = seq(3, 21, by = 2)),
trControl = cv_control,
metric = "RMSE")
# KNN metrics
knn_metrics <- knn_model$results %>%
select(k, RMSE, Rsquared, MAE) %>%
arrange(RMSE)
best_knn <- knn_metrics %>% slice_min(RMSE)
best_knn %>%
mutate(
RMSE = round(RMSE, 4),
Rsquared = round(Rsquared, 4),
MAE = round(MAE, 4)
) %>%
kable(caption = "KNN Performance (5-Fold CV)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white") %>%
add_header_above(c("Best k" = 1, "Performance Metrics" = 3))
# Get KNN predictions (store separately)
knn_predictions <- predict(knn_model, newdata = train_set_final)
# Create temporary dataframe for plotting
knn_plot_df <- data.frame(
PH = train_set_final$PH,
predicted = knn_predictions
)
# Predicted vs Actual
p1 <- ggplot(knn_plot_df, aes(x = PH, y = predicted)) +
geom_point(alpha = 0.5, color = "#b19cd9") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
coord_fixed() +
labs(title = "KNN: Predicted vs Actual PH",
x = "Actual PH", y = "Predicted PH") +
theme_minimal()
# Residuals vs Predicted
knn_plot_df$residuals <- knn_plot_df$PH - knn_plot_df$predicted
p2 <- ggplot(knn_plot_df, aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.5, color = "#b19cd9") +
geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
labs(title = "KNN: Residuals vs Predicted",
x = "Predicted PH", y = "Residuals") +
theme_minimal()
options(repr.plot.width = 12, repr.plot.height = 6)
p1 + p2
X_val_fixed <- val_set_final %>%
select(-PH) %>%
select(all_of(names(train_set_final %>% select(-PH))))
# Add back PH
val_set_fixed <- X_val_fixed
val_set_fixed$PH <- val_set_final$PH
cubist_model <- train(PH ~ .,
data = train_set_final,
method = "cubist",
trControl = cv_control,
tuneLength = 5)
cubist_model$results %>%
arrange(RMSE) %>%
slice_head(n = 1) %>%
select(RMSE, Rsquared, MAE) %>%
mutate(
RMSE = round(RMSE, 4),
Rsquared = round(Rsquared, 4),
MAE = round(MAE, 4)
) %>%
kable(caption = "Cubist Performance (5-Fold CV)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")
cubist_pred_train <- predict(cubist_model, newdata = train_set_final)
cubist_plot_df <- data.frame(
PH = train_set_final$PH,
predicted = cubist_pred_train
)
# Predicted vs Actual plot
p1 <- ggplot(cubist_plot_df, aes(x = PH, y = predicted)) +
geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
coord_fixed() +
labs(title = "Cubist: Predicted vs Actual PH",
x = "Actual PH", y = "Predicted PH") +
theme_minimal(base_size = 12)
# Residuals vs Predicted plot
cubist_plot_df$residuals <- cubist_plot_df$PH - cubist_plot_df$predicted
p2 <- ggplot(cubist_plot_df, aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
labs(title = "Cubist: Residuals vs Predicted",
x = "Predicted PH", y = "Residuals") +
theme_minimal(base_size = 12)
p1 + p2
treebag_model <- train(PH ~ .,
data = train_set_final,
method = "treebag",
trControl = cv_control,
nbagg = 100)
print(treebag_model)
# Extract CV metrics
treebag_cv_metrics <- treebag_model$results %>%
select(RMSE, Rsquared, MAE) %>%
mutate(
RMSE = round(RMSE, 4),
Rsquared = round(Rsquared, 4),
MAE = round(MAE, 4)
)
treebag_cv_metrics %>%
kable(caption = "Tree Bag Performance (5-Fold CV)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")
treebag_pred_train <- predict(treebag_model, newdata = train_set_final)
treebag_plot_df <- data.frame(
PH = train_set_final$PH,
predicted = treebag_pred_train
)
# Predicted vs Actual plot
p1 <- ggplot(treebag_plot_df, aes(x = PH, y = predicted)) +
geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
coord_fixed() +
labs(title = "Tree Bag: Predicted vs Actual PH",
x = "Actual PH", y = "Predicted PH") +
theme_minimal(base_size = 12)
# Residuals vs Predicted plot
treebag_plot_df$residuals <- treebag_plot_df$PH - treebag_plot_df$predicted
p2 <- ggplot(treebag_plot_df, aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
labs(title = "Tree Bag: Residuals vs Predicted",
x = "Predicted PH", y = "Residuals") +
theme_minimal(base_size = 12)
p1 + p2
# Ensure train_set_clean is clean
train_set_clean <- train_set_final %>%
select(-starts_with("rf_"), -starts_with("knn_"), -contains("predicted"), -contains("residuals"))
# One-hot encode Brand.Code
train_xgb <- train_set_clean %>%
dummy_cols(select_columns = "Brand.Code", remove_first = TRUE) %>%
select(-Brand.Code)
# Train XGBoost with warnings suppressed
set.seed(42)
xgb_model <- train(PH ~ .,
data = train_xgb,
method = "xgbTree",
trControl = cv_control,
tuneLength = 3,
verbose = FALSE)
# Extract CV metrics
xgb_model_metrics <- xgb_model$results %>%
arrange(RMSE) %>%
slice_head(n = 1) %>%
select(RMSE, Rsquared, MAE) %>%
mutate(
RMSE = round(RMSE, 4),
Rsquared = round(Rsquared, 4),
MAE = round(MAE, 4)
)
xgb_model_metrics %>%
kable(caption = "XGBoost Performance (5-Fold CV)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")
xgb_pred_train <- predict(xgb_model, newdata = train_xgb)
xgb_plot_df <- data.frame(
PH = train_xgb$PH,
predicted = xgb_pred_train
)
# Predicted vs Actual plot
p1 <- ggplot(xgb_plot_df, aes(x = PH, y = predicted)) +
geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "darkred", size = 1) +
coord_fixed() +
labs(title = "XGBoost: Predicted vs Actual PH",
x = "Actual PH", y = "Predicted PH") +
theme_minimal(base_size = 12)
# Residuals vs Predicted plot
xgb_plot_df$residuals <- xgb_plot_df$PH - xgb_plot_df$predicted
p2 <- ggplot(xgb_plot_df, aes(x = predicted, y = residuals)) +
geom_point(alpha = 0.5, color = "#b19cd9", size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "darkred") +
labs(title = "XGBoost: Residuals vs Predicted",
x = "Predicted PH", y = "Residuals") +
theme_minimal(base_size = 12)
p1 + p2
model_results <- data.frame(
Model = c("Linear Regression", "Random Forest", "KNN", "Tree Bag", "XGBoost", "Cubist"),
RMSE = c(
min(lm_model$results$RMSE),
min(rf_model$results$RMSE),
min(knn_model$results$RMSE),
min(treebag_model$results$RMSE),
min(xgb_model$results$RMSE),
min(cubist_model$results$RMSE)
),
Rsquared = c(
max(lm_model$results$Rsquared),
max(rf_model$results$Rsquared),
max(knn_model$results$Rsquared),
max(treebag_model$results$Rsquared),
max(xgb_model$results$Rsquared),
max(cubist_model$results$Rsquared)
),
MAE = c(
min(lm_model$results$MAE),
min(rf_model$results$MAE),
min(knn_model$results$MAE),
min(treebag_model$results$MAE),
min(xgb_model$results$MAE),
min(cubist_model$results$MAE)
)
)
model_results <- model_results %>%
mutate(
RMSE = round(RMSE, 4),
Rsquared = round(Rsquared, 4),
MAE = round(MAE, 4)
) %>%
arrange(RMSE)
model_results %>%
kable(caption = "Model Performance Comparison (5-Fold CV)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white") %>%
column_spec(1, bold = TRUE) %>%
add_header_above(c(" " = 1, "Performance Metrics" = 3))
# Calculate the 'Importance' of each variable
# We scale it to 100 so the top sensor is always '100'
rf_importance <- varImp(rf_model, scale = TRUE)
# We only show the Top 15 so the chart stays clean and easy to read
plot(rf_importance,
top = 15,
main = "Top 15 Predictors of PH",
col = "#b19cd9")
# Final validation predictions
final_predictions <- predict(rf_model, newdata = val_set_final)
# Calculate metrics
final_score <- postResample(pred = final_predictions, obs = val_set_final$PH)
final_results <- data.frame(
Value = round(c(final_score[1], final_score[2], final_score[3]), 4)
)
final_results %>%
kable(caption = "Random Forest Validation Performance") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
row_spec(0, bold = TRUE, background = "#b19cd9", color = "white")
#Pre-processing test data
#Replace all underscores (_) with dots (.) in the column names
names(test_data) <- gsub("_", ".", names(test_data))
# Fix the Brand.Code hidden NAs (Now it will find it!)
test_data <- test_data %>%
mutate(Brand.Code = ifelse(Brand.Code == "", NA, as.character(Brand.Code))) %>%
mutate(Brand.Code = as.factor(Brand.Code))
# Drop the exact same redundant variables we dropped from training
test_reduced <- test_data %>% select(-all_of(high_cor_vars))
# Impute missing values using missForest
set.seed(42)
test_imputation <- missForest(test_reduced, maxiter = 5, ntree = 100, verbose = FALSE)
test_clean <- test_imputation$ximp
# Center and Scale using the TRAINING set's math (scale_model)
test_final <- predict(scale_model, test_clean)
# Here we take Random Forest model, feeds it the test_final dataset
# Random Forest to predict PH for the blind Test Set
test_predictions <- predict(rf_model, newdata = test_final)
submission_file <- data.frame(
Record_ID = 1:length(test_predictions),
Predicted_PH = test_predictions
)
#save_path <- "C:/Users/yu_we/Downloads/ABC_Beverage_Final_Predictions.csv"
#write.csv(submission_file, save_path, row.names = FALSE)
hist_plot <- ggplot(submission_file, aes(x = Predicted_PH)) +
geom_histogram(fill = "#b19cd9", color = "white", bins = 20) +
theme_minimal() +
labs(title = "What the Factory will produce today (Predicted PH)",
x = "Predicted PH Level",
y = "Number of Batches")
print(hist_plot)
train_ph <- data.frame(PH_Value = train_set_final$PH, Data_Type = "1. Historical (Training)")
test_ph <- data.frame(PH_Value = submission_file$Predicted_PH, Data_Type = "2. Predicted (Test)")
combined_ph <- rbind(train_ph, test_ph)
overlap_plot <- ggplot(combined_ph, aes(x = PH_Value, fill = Data_Type)) +
geom_density(alpha = 0.6) +
theme_minimal() +
scale_fill_manual(values = c("1. Historical (Training)" = "darkgray",
"2. Predicted (Test)" = "darkorange")) +
labs(title = "Prediction v. Reality",
x = "PH Value",
y = "Density",
fill = "Data Source")
print(overlap_plot)