Overview & Introduction

Objective:

The purpose of this analysis is to understand the beverage manufacturing process at the ABC Beverage company. This is done by completing an expansive evaluation of pH levels, which consists of preliminary data analysis and scrupulous model selection for predicting pH levels in the manufacturing process.

Introduction:

We began this analysis by looking into which variable would present issues for the model. There was a glaring problem with the state of the original beverage data. Variables Brand Code and MFR were missing 4.5% and 8.2% of data respectively. Following the discovery of missingness,the group moved to further evaluate the data through thorough exploratory data analysis. Within the EDA we confirmed that the dataset contained missing values, skewness, and outliers, with key variables like Brand Code and MFR requiring special attention in the working steps prior to modeling.

After cleaning the data we shifted our attention to building models and evaluating performance. The group ended on five handpicked models to evaluate. Random Forest, Gradient Boosting, Neural Networks, Support Vector Machine, and MARS were chosen. We thought that these models would cover a range of techniques instead of focusing on one area of predictive modeling. Lastly we settled on one singular model that was more performant that any of the others.

Key Findings:

Due to the high performance relative to the other four models the group chose that random forest was the best fit. While evaluating the top 10 predictors in our random forest model the mnf_flow variable stood out the most. Carrying near 45% of the importance in the model the mnf_flow is undoubtedly the most important variable, in consultation with the correlation matrix in the EDA section, we can deduce that mnf_flow is not only important but also negatively correlated with pH levels in beverage production. The model behind this discovery is a random forest model which boasts a Test R-Squared of 71.3%, a Test RMSE of 0.096 and a Test MAE of 0.069 which highlights this models accuracy and reliability in predicting pH levels.

Background:

pH is a critical quality metric in beverage production because it directly affects the flavor, safety, shelf life, and consistency of the product. Maintaining the correct pH level ensures that the beverage is resistant to microbial growth and chemical deterioration, which is essential for consumer safety and product stability. Regulatory requirements mandate that beverage manufacturers closely monitor and control pH levels to meet industry standards and ensure product quality. Non-compliance with these regulations can lead to product recalls, legal issues, and damage to the company’s reputation. The FDA has a slew of resources and best practices about this.

1. Data Exploration

1.1 Load data

Data for training and evaluation are obtained from the GitHub repo via URLs, ensuring that the most recent version is used for analysis. Temporary files are used as intermediate storage to speed up the data collecting process.

1.2 Summary Statistics

Beverage dataset: 2571 rows x 33 columns.
Evaluation dataset: 267 rows x 33 columns.
Mostly numeric, with one categorical variable (Brand Code).

NAs:
Brand Code: ~4.7% missing in beverage_df, ~3.0% missing in eval_df.
Many numeric features have a small percentage of missing values, generally < 2%.
PH: 4 NAs for Brand code B, the target variable in beverage_df. eval_df has all values missing for PH (used for prediction).
Impute missing values for numeric features using median or mean (or more advanced imputation if correlated features exist). Missing values should be handled within a robust pipeline to avoid information loss.
Skewness and outliers
Variables like Mnf Flow have extreme values (mean is heavily influenced by outliers). Range: -100.2 to 229.4.
PH: Check for extreme pH values that may affect model accuracy.
Hyd Pressure1–4: High standard deviations (e.g., Hyd Pressure2 with a mean of ~21 and SD of 16.4).
Analyze the distribution of these variables using histograms or boxplots. Maybe winsorization or BoxCox/log transformation for skewed distributions.
Feature Importance for PH pred
Carb Volume, Fill Ounces, PC Volume, Carb Pressure, and Carb Temp have small sd and are likely controlled manufacturing parameters. These might directly influence pH.
Brand Code can be treated as a categorical predictor for brand-specific variations.
Correlation or feature importance to identify which variables most strongly influence PH.
Brand Code: 4 levels (A, B, C, D).
Unbalanced distribution: B accounts for ~50% of records, while A, C, and D are much smaller. The imbalance might affect models like decision trees or ensemble methods.
Apply stratified sampling or weighting to handle imbalance during training. Explore interaction effects between Brand Code and numeric variables.
Multicollinearity
Variables such as Carb Volume, PC Volume, and Carb Pressure might be correlated due to their role in carbonation.
Multiple pressure-related variables (Carb Pressure1, Fill Pressure, etc.) and filler speed/level metrics could also have collinear relationships.
Compute a correlation matrix to detect highly correlated predictors.
Data Leakage
Variables like PSC, PSC Fill, and PSC CO could be downstream measures dependent on pH. Confirm whether these are part of the production process or outcome metrics.
Analyze the production process to ensure no data leakage into the model.
eval_df
All PH values are missing for prediction.
Structure and missingness are similar to beverage_df. Ensure preprocessing and feature engineering pipelines are consistent between training and evaluation datasets.

1.3 EDA

Featured Variables:

Carb Volume: Volume of the beverage that is carbonated.
Fill Ounces: Volume of the beverage filled into containers, measured in ounces.
PC Volume: Could indicate “Process Control Volume,” a measure related to production parameters.
Carb Pressure: Pressure used during the carbonation process.
Carb Temp: Temperature of the beverage during carbonation.
PSC: Potentially a quality control metric, the exact meaning would depend on industry context.
PSC Fill: Quality or quantity metric related to the filling process.
PSC CO2: Measures CO2 levels or pressure during processing.
Mnf Flow: Manufacturing flow rate, likely related to how fast the production line operates.
Carb Pressure1: Another measurement of pressure during carbonation, possibly at a different production stage.
Fill Pressure: Pressure maintained during the filling of beverages into containers.
Hyd Pressure1, Hyd Pressure2, Hyd Pressure3, Hyd Pressure4: Different points of hydraulic pressure measurement within the machinery.
Filler Level: Level to which containers are filled; critical for consistency.
Filler Speed: Speed at which the filling machinery operates.
Temperature: General temperature measurement.
Carb Flow: Flow rate of the carbonation process.
Density: Physical density of the beverage, which impacts taste and mouth feel.
MFR: Manufacturer’s reference; could relate to specific production settings or machine identifiers.
Balling: Measurement related to the sugar content of a liquid, typically used in brewing.
Pressure Vacuum: Pressure readings from a vacuum process, possibly in packaging.
Oxygen Filler: Likely measures the presence of oxygen during filling, which must be minimized in certain beverages.
Bowl Setpoint: Setpoint for a particular part of the machinery, possibly related to mixing or blending.
Pressure Setpoint: Target pressure to be maintained within certain machinery or process stages.
Air Pressurer: Air pressure readings.
Alch Rel: Could be “Alcohol Release” or related to the alcohol content management.
Carb Rel: Possibly “Carbonation Reliability” or a similar measure of carbonation consistency.
Balling Lvl: Specific gravity of the liquid related to its sugar content, important in quality control in brewing and beverage manufacturing.

EDA Takeaways:

The pH of a solution is measured on a scale from 0 to 14, indicating its acidity or alkalinity. In beverage manufacturing, maintaining the correct pH level is crucial for ensuring flavor, safety, shelf life, and uniformity.

The data shows that the most common pH level is around 8.5, which appears to be the target for many batches of beverages. For carbonated beverages, a slightly alkaline pH level can: - Improve Taste: Balance the tang of carbonation with a smoother, less acidic flavor. - Aid in Shelf Stability: Ensure the beverage is resistant to microbial growth and chemical deterioration. - Reflect Process Consistency: Indicate consistent raw material quality, carbonation levels, and filling precision.

The distribution of pH is roughly normal with a slight negative skew, centered around 8.5, and includes some outliers at both the lower and upper tails.

Brand B has significantly more entries compared to other brands, suggesting the need for balancing or stratified sampling during model training. The pH distribution varies across Brand Codes, with Brand C having the lowest median pH and Brand B the highest. Outliers are present in all brand codes, and it is important to investigate whether these outliers are due to measurement errors or valid extreme cases.

EDA - Distributions:

Several variables in the dataset exhibit notable characteristics and potential issues:

Outliers: MFR, Fill Speed, Hyd Pressure3, and Pressure Setpoint contain outliers that may need to be addressed.
Bimodal Distribution: Variables such as Alch Rel, Balling, and Balling Lvl show bimodal distributions, suggesting that binning might be necessary.
Long-Tailed Distribution: Mnf Flow has a long-tailed distribution, which could affect model performance.
Sharp Peaks: Filler Speed, Hyd Pressure2, and Usage Cont exhibit sharp peaks in their distributions.
Spread and Skewness: Hyd Pressure4 shows an interesting spread, providing insights into process variability. MFR and Oxygen Filler are heavily right-skewed, indicating the need for normalization.

For variables like MFR and Usage Cont, applying log or Box-Cox transformations can help normalize their distributions.

Mnf Flow: - A value of -100 appears frequently, which could indicate: - Missing Data: A common placeholder value. - Domain-Specific Encoding: For example, representing “no flow” or a specific state in manufacturing. - Data Error: A mistake in data collection or recording. - It is essential to confirm whether -100 is a legitimate value or a placeholder. If it is a placeholder, it should be treated as missing and imputed or removed to ensure data integrity.

Initial Missingness:

The dataset contains varying levels of missing values across different variables:

MFR: This variable has the highest percentage of missing values at around 7%. It may be challenging to collect consistently or might not be relevant across all rows. If MFR is critical, consider using mean, median, or model-based imputation. If it does not strongly correlate with the target variable or other features, it might be best to drop it.
Variables with <3% Missingness: For variables like PSC CO2 and PC Volume, which have less than 3% missing values, simple imputation techniques can be applied.
Brand Code: Mode imputation is recommended for handling missing values in the Brand Code variable.

1.4 Correlation Matrix

Preliminary Findings:

The data revealed several important correlations and potential multicollinearity issue. We used this correlation matrix to try and begin to understand the interworkings of the production process. Below is a write-up of these initial findings.

Intercorrelations: Density, Balling, and Carb Rel show strong intercorrelations, which could lead to multicollinearity issues in linear regression models.
Negative Correlations: Mnf Flow and Usage Cont have strong negative correlations, suggesting the need for additional preprocessing or transformation before moving to model selection.
Interaction Terms: Exploring interaction terms between variables like Bowl Setpoint and Filler Level or between Pressure Vacuum and Carb Flow could provide further insights into their relationships with pH.

Key Variable Correlations: - Bowl Setpoint (correlation: 0.36): This parameter likely affects the flow or intensity of ingredient mixing during carbonation or flavoring stages. A higher bowl setpoint may result in more uniform mixing, reducing pH variability. - Filler Level (correlation: 0.35): Ensures containers are properly filled to prevent extra air or gas imbalances. Variations in filler level can influence carbonation levels and pH, so proper calibration is essential. - Pressure Vacuum (correlation: 0.22): Assists with gas exchange during filling. Variations in vacuum pressure can cause inconsistent carbonation, affecting pH. Monitoring this parameter ensures the desired carbonation “bite.” - Oxygen Filler (correlation: 0.17): Determines the amount of oxygen introduced during filling. Higher oxygen levels can accelerate oxidation, reducing pH and compromising flavor stability. Maintaining low oxygen levels is crucial for product quality. - Mnf Flow (correlation: -0.45): Describes the rate at which materials flow during production. A higher flow rate may result in irregular mixing, reduced carbonation, and a shift in pH. Adjusting flow rates helps maintain consistent product attributes.

Table 6: Correlation with PH
Variable	Coefficient
PH	1.0000000
Bowl Setpoint	0.3615875
Filler Level	0.3520440
Carb Flow	0.2335937
Pressure Vacuum	0.2197355
Carb Rel	0.1960515
Alch Rel	0.1666822
Oxygen Filler	0.1644854
Balling Lvl	0.1093712
PC Volume	0.0988667
Density	0.0955469
Balling	0.0767002
Carb Pressure	0.0762134
Carb Volume	0.0721325
Carb Temp	0.0322794
Air Pressurer	-0.0079972
PSC Fill	-0.0238098
Filler Speed	-0.0408830
MFR	-0.0451965
Hyd Pressure1	-0.0470664
PSC	-0.0698730
PSC CO2	-0.0852599
Fill Ounces	-0.1183359
Carb Pressure1	-0.1187642
Hyd Pressure4	-0.1714340
Temperature	-0.1826596
Hyd Pressure2	-0.2226600
Hyd Pressure3	-0.2681018
Pressure Setpoint	-0.3116639
Fill Pressure	-0.3165145
Usage cont	-0.3576120
Mnf Flow	-0.4592313

2. Data Preparation

Identification of problem areas

Standardizing Column Names:
- Using make_clean_names() to eliminate issues caused by case sensitivity or spaces, making data manipulation easier and less error-prone.
Zero Variance Predictors:
- Identifying and removing zero variance predictors (e.g., Hyd Pressure1) to simplify the modeling process, boost computational efficiency, and potentially increase model performance.
Missing Value Imputation:
- K-Nearest Neighbors (KNN): Fills in missing values using the average (or median) of the ‘k’ most similar occurrences. Ideal for large datasets with simple relationships.
- Random Forest Imputation: Captures non-linear correlations by learning decision rules from other features, providing more accurate approximations than simpler methods like mode imputation.
Handling Mnf Flow Values:
- If ‘-100’ is not a valid data point, mark it as missing and use KNN imputation to ensure replacement values reflect typical operational data, preserving consistency in the manufacturing flow analysis.
Outliers:
- Initially not adjusted to retain important information about the system’s behavior under unusual conditions, which can be crucial for predictive modeling.
One-Hot Encoding:
- Converts categorical variables, such as Brand Code, into binary variables, making the model more adaptable in learning patterns unique to each category.
Interaction Terms and Polynomial Features:
- Interaction Terms: Simulate the effects of interactions between variables on pH levels (e.g., pc_volume_pressure).
- Polynomial Features: Capture non-linear interactions (e.g., carb_volume2).
Binning Continuous Variables:
- Helps manage non-linear effects and improve model performance by transforming continuous variables into categorical ones (e.g., carb_temp_binned).
Transformations:
- Log and Square Root Transformations: Applied to variables like filler_speed and air_pressurer to standardize their distributions and reduce skewness.
- Transformations help reduce variance and make data more normally distributed, enhancing the effectiveness of many statistical learning approaches that rely on normally distributed data.

2.1 Missing values

Cleaning up the naming conventions in both the evaluation dataframe and out original dataset.

##  [1] "brand_code"        "carb_volume"       "fill_ounces"      
##  [4] "pc_volume"         "carb_pressure"     "carb_temp"        
##  [7] "psc"               "psc_fill"          "psc_co2"          
## [10] "mnf_flow"          "carb_pressure1"    "fill_pressure"    
## [13] "hyd_pressure1"     "hyd_pressure2"     "hyd_pressure3"    
## [16] "hyd_pressure4"     "filler_level"      "filler_speed"     
## [19] "temperature"       "usage_cont"        "carb_flow"        
## [22] "density"           "mfr"               "balling"          
## [25] "pressure_vacuum"   "ph"                "oxygen_filler"    
## [28] "bowl_setpoint"     "pressure_setpoint" "air_pressurer"    
## [31] "alch_rel"          "carb_rel"          "balling_lvl"

##               freqRatio percentUnique zeroVar  nzv
## hyd_pressure1  31.11111      9.529366   FALSE TRUE

Imputing missing values knnImputation was used for for numeric variables and Random Forest was used for Brand Code and mnf_flow.

##       carb_volume       fill_ounces         pc_volume     carb_pressure 
##                 0                 0                 0                 0 
##         carb_temp               psc          psc_fill           psc_co2 
##                 0                 0                 0                 0 
##    carb_pressure1     fill_pressure     hyd_pressure2     hyd_pressure3 
##                 0                 0                 0                 0 
##     hyd_pressure4      filler_level      filler_speed       temperature 
##                 0                 0                 0                 0 
##        usage_cont         carb_flow           density               mfr 
##                 0                 0                 0                 0 
##           balling   pressure_vacuum                ph     oxygen_filler 
##                 0                 0                 0                 0 
##     bowl_setpoint pressure_setpoint     air_pressurer          alch_rel 
##                 0                 0                 0                 0 
##          carb_rel       balling_lvl          mnf_flow        brand_code 
##                 0                 0                 0               120

##       carb_volume       fill_ounces         pc_volume     carb_pressure 
##                 0                 0                 0                 0 
##         carb_temp               psc          psc_fill           psc_co2 
##                 0                 0                 0                 0 
##          mnf_flow    carb_pressure1     fill_pressure     hyd_pressure2 
##                 0                 0                 0                 0 
##     hyd_pressure3     hyd_pressure4      filler_level      filler_speed 
##                 0                 0                 0                 0 
##       temperature        usage_cont         carb_flow           density 
##                 0                 0                 0                 0 
##               mfr           balling   pressure_vacuum     oxygen_filler 
##                 0                 0                 0                 0 
##     bowl_setpoint pressure_setpoint     air_pressurer          alch_rel 
##                 0                 0                 0                 0 
##          carb_rel       balling_lvl        brand_code                ph 
##                 0                 0                 8               267

##       carb_volume       fill_ounces         pc_volume     carb_pressure 
##                 0                 0                 0                 0 
##         carb_temp               psc          psc_fill           psc_co2 
##                 0                 0                 0                 0 
##    carb_pressure1     fill_pressure     hyd_pressure2     hyd_pressure3 
##                 0                 0                 0                 0 
##     hyd_pressure4      filler_level      filler_speed       temperature 
##                 0                 0                 0                 0 
##        usage_cont         carb_flow           density               mfr 
##                 0                 0                 0                 0 
##           balling   pressure_vacuum                ph     oxygen_filler 
##                 0                 0                 0                 0 
##     bowl_setpoint pressure_setpoint     air_pressurer          alch_rel 
##                 0                 0                 0                 0 
##          carb_rel       balling_lvl          mnf_flow        brand_code 
##                 0                 0                 0                 0

##       carb_volume       fill_ounces         pc_volume     carb_pressure 
##                 0                 0                 0                 0 
##         carb_temp               psc          psc_fill           psc_co2 
##                 0                 0                 0                 0 
##          mnf_flow    carb_pressure1     fill_pressure     hyd_pressure2 
##                 0                 0                 0                 0 
##     hyd_pressure3     hyd_pressure4      filler_level      filler_speed 
##                 0                 0                 0                 0 
##       temperature        usage_cont         carb_flow           density 
##                 0                 0                 0                 0 
##               mfr           balling   pressure_vacuum     oxygen_filler 
##                 0                 0                 0                 0 
##     bowl_setpoint pressure_setpoint     air_pressurer          alch_rel 
##                 0                 0                 0                 0 
##          carb_rel       balling_lvl        brand_code                ph 
##                 0                 0                 0               267

2.2 New features

One hot encoding, interaction terms

Comparing newly formed and transformed variables

The team created six new transformation for six original variables in the dataset to try an improve performance. The changes were seen across both the original dataset and the eval dataset.

Transformed Variables and the transformations applied: * filler_speed - (log) * mfr - (log) * temperature (sqrt) * air_pressure (sqrt) * oxygen_filter (shifted log) * psc_co2 (shifted log)

Tables 7 and 8 were created to visualize this change in both new transformed df’s.

3. Build Models

Splitting data into training and testing sets (via createDataPartition()). Its goal is to test the model using previously unseen data in order to predict how well it would perform in real-world circumstances. Using a stratified sampling strategy ensures that the distribution of essential variables (such as pH levels) remains consistent throughout the training and testing sets.

The use of numerous models (Random Forest, Gradient Boosting, Neural Networks, Support Vector Machine, and MARS) provides a comprehensive view of how various techniques perform with the provided data.

MARS is a flexible non-parametric technique capable of effectively modeling nonlinearities and interactions.
Support Vector Machine: Effective for determining the ideal border between outputs; best suited for boundary decisions rather than probability estimations. Neural Networks: Ideal for collecting complicated patterns in huge datasets, but may demand significant computational resources.
Gradient Boosting: Effective for managing a variety of data kinds and gradually improves performance through boosting.
Random Forest: Effective for handling non-linear relationships and resistant to overfitting.

Cross-validation (specified in trainControl()) ensures that the model is rigorously validated and not tailored to a single subset of the data. Repeated cross-validation helps to reduce variance in the model evaluation process, resulting in a more accurate performance estimate. Using grids (expand.grid()) for distinct model parameters enables systematic searching through a set of alternatives to discover the ideal combination that maximizes model performance.

MARS

MARS is a versatile modeling method that successfully captures nonlinearities and variable interactions. Preparation entails scaling and centring the data to standardize feature scales to focus on discovering significant interactions and trends without being influenced by different variable scales.
Trained using earth method with a parameter grid specifying degrees and number of pruning terms (nprune). Centering and scaling were applied to normalize the variables. MARS builds models by generating piecewise linear splines, which are especially useful for dealing with complex, non-linear interactions that may be present in data from various manufacturing processes. To avoid overfitting, the model starts with the maximum number of basis functions and prunes them using the generalized cross-validation criterion.

The best model, with a degree of 2 and 20 pruning terms, showed an RMSE of 0.1269, R-squared of 0.459, and MAE of 0.0960 on the training set, indicating moderate predictive accuracy. mnf_flow was the most influential, followed by variables like brand_code.C, and hyd_pressure3, suggesting these factors are critical in predicting pH levels.

##    degree nprune      RMSE Rsquared       MAE     RMSESD RsquaredSD      MAESD
## 38      2     20 0.1269369 0.459168 0.0960094 0.01134229 0.07564645 0.00693096

##       RMSE   Rsquared        MAE 
## 0.12208106 0.51031566 0.09219265

            Overall

mnf_flow 100.00 brand_code.C 72.84 hyd_pressure3 70.05 alch_rel 64.81 bowl_setpoint 60.48 pressure_vacuum 57.10 air_pressurer 49.35 balling 42.60 density 39.93 temperature 30.51 usage_cont 22.44 carb_pressure1 18.34

## earth variable importance
## 
##                 Overall
## mnf_flow         100.00
## brand_code.C      72.84
## hyd_pressure3     70.05
## alch_rel          64.81
## bowl_setpoint     60.48
## pressure_vacuum   57.10
## air_pressurer     49.35
## balling           42.60
## density           39.93
## temperature       30.51
## usage_cont        22.44
## carb_pressure1    18.34

Support Vector Machine

SVM data preparation involves scaling and centering, which are critical for this model because it creates a hyperplane in high-dimensional space to distinguish various classes. The scale of the data influences the hyperplane’s location, and unscaled data might mislead the learning algorithm by emphasizing larger-scale elements disproportionately.

SVM is constructed by determining the best hyperplane that divides the classes with the greatest margin. It is resistant to outliers and effective in high-dimensional spaces. Kernel functions, such as the radial basis function utilized here, enable SVM to learn non-linear bounds, making it suitable for a variety of data patterns.

Achieved an RMSE of 0.1194, R-squared of 0.5236, and MAE of 0.0877, showing slightly better performance than MARS.

Illustrated less reliance on any single variable, with mnf_flow still significant but less dominant, showcasing SVM’s distribution of importance across multiple features.

##   sigma  C      RMSE  Rsquared        MAE      RMSESD RsquaredSD       MAESD
## 7  0.01 10 0.1193689 0.5235771 0.08771225 0.008812062 0.05431116 0.004872963

##       RMSE   Rsquared        MAE 
## 0.11787203 0.54495184 0.08445738

Variable importance

               Overall

mnf_flow 0.229134 usage_cont 0.149547 bowl_setpoint 0.114516 filler_level 0.101887 pressure_setpoint 0.097533 carb_flow 0.086556 brand_code.C 0.078800 hyd_pressure3 0.053540 pressure_vacuum 0.048704 fill_pressure 0.043607 hyd_pressure2 0.034975 brand_code.D 0.034145 oxygen_filler 0.032301 carb_rel 0.024706 hyd_pressure4 0.024241 temperature 0.024065 alch_rel 0.019814 brand_code.B 0.012828 psc 0.009141 fill_ounces 0.008995

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 33)
## 
##                    Overall
## mnf_flow          0.229134
## usage_cont        0.149547
## bowl_setpoint     0.114516
## filler_level      0.101887
## pressure_setpoint 0.097533
## carb_flow         0.086556
## brand_code.C      0.078800
## hyd_pressure3     0.053540
## pressure_vacuum   0.048704
## fill_pressure     0.043607
## hyd_pressure2     0.034975
## brand_code.D      0.034145
## oxygen_filler     0.032301
## carb_rel          0.024706
## hyd_pressure4     0.024241
## temperature       0.024065
## alch_rel          0.019814
## brand_code.B      0.012828
## psc               0.009141
## fill_ounces       0.008995

Neural networks

Data standardization is required for neural networks to learn efficiently. Scaling and centering reduce the dominance of higher magnitude features and improve gradient descent optimization by guaranteeing more uniform scaling across all inputs.

Neural networks are made up of layers of interconnected nodes, with each connection representing a weight; the nodes apply activation functions to their input data. Training entails altering these weights based on the difference between expected and actual results. The model utilized here, avNNet, is an ensemble of neural networks that stabilize predictions.

Noted for the best R-squared of 0.5586 among the models with an RMSE of 0.1145 and MAE of 0.0856, suggesting a strong fit.

Similar to SVM, shows a balanced importance across several predictors, highlighting the model’s use of complex feature interactions.

##    decay size   bag      RMSE  Rsquared       MAE     RMSESD RsquaredSD
## 10     0   10 FALSE 0.1145209 0.5585798 0.0855949 0.01006998 0.06574056
##          MAESD
## 10 0.005049342

##       RMSE   Rsquared        MAE 
## 0.11073852 0.59788239 0.08266435

Variable importance

               Overall

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 33)
## 
##                    Overall
## mnf_flow          0.229134
## usage_cont        0.149547
## bowl_setpoint     0.114516
## filler_level      0.101887
## pressure_setpoint 0.097533
## carb_flow         0.086556
## brand_code.C      0.078800
## hyd_pressure3     0.053540
## pressure_vacuum   0.048704
## fill_pressure     0.043607
## hyd_pressure2     0.034975
## brand_code.D      0.034145
## oxygen_filler     0.032301
## carb_rel          0.024706
## hyd_pressure4     0.024241
## temperature       0.024065
## alch_rel          0.019814
## brand_code.B      0.012828
## psc               0.009141
## fill_ounces       0.008995

Gradient Boosting

Scaling is not required for GBM preparation because the model is based on scale-invariant decision trees. However, centering and scaling are used here to ensure consistency in preprocessing across models, which aids in comparative model evaluation.

GBM transforms an ensemble of weak prediction models, often decision trees, into a strong learner in a sequential process. Each tree is designed to remedy mistakes in earlier trees. The learning process and model complexity are controlled by parameters such as the number of trees, tree depth, and learning rate.

Tuned over a range of tree numbers, interaction depths, and minimum observations in nodes, using a shrinkage factor to regulate learning.

Exhibited the best performance metrics in this set with an RMSE of 0.1102, R-squared of 0.5956, and MAE of 0.0830.

mnf_flow and brand_code.C were significant, along with other operational parameters, indicating their pivotal roles in pH level determination.

##     shrinkage interaction.depth n.minobsinnode n.trees      RMSE  Rsquared
## 110      0.01                 7             10    1000 0.1101819 0.5955535
##            MAE      RMSESD RsquaredSD       MAESD
## 110 0.08304377 0.009441458 0.06105948 0.005108199

##       RMSE   Rsquared        MAE 
## 0.10548388 0.64448254 0.07928038

Variable importance

               Overall

mnf_flow 328.40 brand_code.C 101.37 alch_rel 94.79 oxygen_filler 87.81 usage_cont 87.32 temperature 75.56 carb_pressure1 59.25 pressure_vacuum 56.63 carb_rel 55.83 setpoint_diff 54.23 balling_lvl 47.54 air_pressurer 46.17 carb_flow 45.93 filler_speed 45.79 balling 43.74 hyd_pressure3 37.20 density 35.93 mfr 34.59 hyd_pressure2 25.87 pc_volume_pressure 25.66

## gbm variable importance
## 
##   only 20 most important variables shown (out of 31)
## 
##                    Overall
## mnf_flow            328.40
## brand_code.C        101.37
## alch_rel             94.79
## oxygen_filler        87.81
## usage_cont           87.32
## temperature          75.56
## carb_pressure1       59.25
## pressure_vacuum      56.63
## carb_rel             55.83
## setpoint_diff        54.23
## balling_lvl          47.54
## air_pressurer        46.17
## carb_flow            45.93
## filler_speed         45.79
## balling              43.74
## hyd_pressure3        37.20
## density              35.93
## mfr                  34.59
## hyd_pressure2        25.87
## pc_volume_pressure   25.66

Random Forest

Random Forest models, like GBM, are tree-based and do not require data scaling. The data, however, is scaled and centered to ensure that the preparation methods are consistent across all models used in the research.

During training, Random Forest generates numerous decision trees and outputs the mode of the classes (classification) or mean prediction (regression) of each tree. It is particularly effective for classification and regression, and its ensemble method allows it to handle overfitting very well.

Delivered the highest R-squared value of 0.7130, RMSE of 0.0963, and MAE of 0.0699, indicating excellent generalization.

Demonstrates significant contributions from operational variables like mnf_flow and oxygen_filler, aligning with expected impacts on pH.

##   mtry      RMSE  Rsquared        MAE     RMSESD RsquaredSD       MAESD
## 5   10 0.1007928 0.6741332 0.07271081 0.00872876 0.05015008 0.003999654

##       RMSE   Rsquared        MAE 
## 0.09630017 0.71303966 0.06989033

Variable importance

  Overall

mnf_flow 44.54 oxygen_filler 34.35 pressure_vacuum 31.16 carb_rel 29.64 air_pressurer 29.19 brand_code.C 28.84 temperature 28.52 filler_speed 28.06 balling_lvl 28.02 alch_rel 24.77 usage_cont 24.45 density 23.91 balling 23.79 carb_flow 22.90 setpoint_diff 21.78 carb_volume2 21.00 carb_pressure1 20.62 hyd_pressure3 18.45 filler_level 17.95 mfr 17.05

## rf variable importance
## 
##   only 20 most important variables shown (out of 31)
## 
##                 Overall
## mnf_flow          44.54
## oxygen_filler     34.35
## pressure_vacuum   31.16
## carb_rel          29.64
## air_pressurer     29.19
## brand_code.C      28.84
## temperature       28.52
## filler_speed      28.06
## balling_lvl       28.02
## alch_rel          24.77
## usage_cont        24.45
## density           23.91
## balling           23.79
## carb_flow         22.90
## setpoint_diff     21.78
## carb_volume2      21.00
## carb_pressure1    20.62
## hyd_pressure3     18.45
## filler_level      17.95
## mfr               17.05

4. Model Selection

Resampling metrics (RMSE, R², and MAE) from model training are compared to test metrics to evaluate overfitting and generalization capabilities.Models are judged on their predicted accuracy, robustness, and computational efficiency. The final decision (e.g., Random Forest with the best test metrics) is based on a combination of these criteria, with the goal of producing a model that performs well in a variety of circumstances.

The chosen model should meet operational needs such as ease of deployment, stakeholder interpretability, and long-term maintenance. Model performance should be checked over time using new data to verify that it continues to operate as expected under changing settings.

Based on the resampling (cross-validation) metrics and test metrics for each model, we can perform a detailed analysis to select the best model for predicting pH values.

Random Forest (RF) outperforms practically all metrics, especially in test metrics, where it has the lowest RMSE (0.0963) and the greatest R-squared (0.7130). The MAE is likewise the lowest among the models, at 0.0699, showing that RF predictions are relatively close to real values with less error on average.
Gradient Boosting Machine (GBM) follows closely behind RF, with competitive scores, particularly in test R-squared (0.6445), indicating its resilience in dealing with the dataset’s variation.
Neural Network (NNet) and Support Vector Machine (SVM) both perform well, with NNet marginally outperforming in terms of resampling and testing metrics, giving it a viable alternative if ensemble approaches are eschewed.

While Multivariate Adaptive Regression Splines (MARS) are beneficial for capturing nonlinear interactions, their performance metrics fall behind those of the other models.

Because of their ensemble nature and iterative improvements, RF and GBM models may require careful adjustment to minimize overfitting. This category also includes NNet, which has a layered structure and the possibility for deep interactions. Compared to ensemble and neural network models, MARS and SVM are easier to interpret. If model explanation is important, these may be favored despite inferior performance. RF and GBM require more training time and resources than simpler models such as MARS or SVM. Given the findings, Random Forest is suggested for deployment due to its superior performance in all important measures. If there are limits on model transparency or processing resources, simpler models such as SVM or even NNet may be considered as alternatives. Further steps should include a rigorous validation process to confirm the RF model’s stability, an investigation into feature engineering based on RF variable relevance, and potential ensemble procedures that could combine strengths from many models to improve prediction accuracy.

	Resample Metrics			Test Metrics
Model	Resample RMSE	Resample R²	Resample MAE	Test RMSE	Test R²	Test MAE
MARS	0.1269369	0.4591680	0.0960094	0.1220811	0.5103157	0.0921927
SVM	0.1193689	0.5235771	0.0877122	0.1178720	0.5449518	0.0844574
NNet	0.1145209	0.5585798	0.0855949	0.1107385	0.5978824	0.0826644
GBM	0.1101819	0.5955535	0.0830438	0.1054839	0.6444825	0.0792804
RF	0.1007928	0.6741332	0.0727108	0.0963002	0.7130397	0.0698903

Save results to excel file

5. Conclusion

References

Structure of the Non-Technical Report

Executive Summary
Objective: Briefly describe the purpose of the analysis, which is to understand and predict pH levels in the manufacturing process as per new regulations.
Key Findings: Summarize the main outcomes, such as the most important factors affecting pH and the performance of the predictive model.

Introduction
Background: Provide context about why pH is a critical quality metric in beverage production and the impact of regulatory requirements.
Project Scope: Outline what the project covers, including data sources, main analytical goals, and the expected impact on the production process.

Data Overview
Data Sources: Describe where the data comes from and what it represents (production parameters, quality measures, etc.).
Data Quality: Highlight the cleanliness of the data, any challenges with missing values or outliers, and how these were addressed.

Insights from Data Exploration
Descriptive Statistics: Discuss general trends observed in the data, such as typical pH levels and variability across different brand codes.
Visuals and Interpretations: Use simple graphs like bar charts or line graphs to show important trends and patterns that inform about the pH level. For instance, show how pH varies by different manufacturing parameters without getting into statistical jargon.

Predictive Modeling
Approach: Explain that various models were tested to find the best way to predict pH levels, focusing on the usefulness of the models in making predictions that help comply with regulations.
Model Selection: Discuss why the chosen model (e.g., Random Forest) was suitable, focusing on its ability to provide accurate and reliable predictions rather than the technical details.

Implications and Recommendations
Operational Impact: Discuss how the model can be used in daily operations to ensure compliance and optimize production processes.
Future Steps: Suggest how the analysis could be extended or improved, such as collecting more data over time or integrating real-time data analytics into the production line.

Conclusion
Summary of Benefits: Reiterate the benefits of the predictive modeling project, emphasizing improved regulatory compliance and enhanced production efficiency.
Call to Action: Encourage stakeholders to consider the recommendations and support further steps to integrate analytics into production practices.

Appendix: R Code

## ----setup, message=FALSE, warning=FALSE, include=FALSE-------------------------------------------------------------------------------------------------------------------
#chunks
knitr::opts_chunk$set(eval=TRUE, message=FALSE, warning=FALSE, echo=FALSE, fig.height=5, fig.align='center')

#libraries
library(tidyverse)
library(DMwR)
library(xgboost)
library(vip)
library(summarytools)
library(fpp3)
library(readxl)
library(curl)
library(latex2exp)
library(seasonal)
library(GGally)
library(gridExtra)
library(doParallel)
library(reshape2)
library(Hmisc)
library(corrplot)
library(e1071)
library(caret)
library(VIM)
library(rpart)
library(forecast)
library(urca)
library(earth)
library(glmnet)
library(cluster)
library(kernlab)
library(aTSA)
library(AppliedPredictiveModeling)
library(mlbench)
library(randomForest)
library(party)
library(gbm)
library(Cubist)
library(partykit)
library(kableExtra)
library(factoextra)
library(FactoMineR)
library(naniar)
library(mice)
library(janitor)
library(writexl)


## ----load_data------------------------------------------------------------------------------------------------------------------------------------------------------------
#Load train data
#URL the raw .xlsx file
url <- "https://raw.githubusercontent.com/ex-pr/project_2_DATA624/main/StudentData.xlsx"
#Temporary path
temp_file <- tempfile(fileext = ".xlsx")
#Download file
curl_download(url, temp_file)
#Read file
data_beverage <- read_excel(temp_file)
#Copy data
beverage_df <- data_beverage
#Clean temp file
unlink(temp_file)

#Load test data
#URL the raw .xlsx file
url <- "https://raw.githubusercontent.com/ex-pr/project_2_DATA624/main/StudentEvaluation.xlsx"
#Temporary path
temp_file <- tempfile(fileext = ".xlsx")
#Download file
curl_download(url, temp_file)
#Read file
data_eval <- read_excel(temp_file)
#Copy data
eval_df <- data_eval


## ----summary_statistics---------------------------------------------------------------------------------------------------------------------------------------------------
#Check first rows of data
DT::datatable(
      beverage_df[1:10,],
      options = list(scrollX = TRUE,
                     deferRender = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     info = FALSE,  
                     paging=FALSE,
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 1: First 10 Rows of Beverage Data'
  )) 

DT::datatable(
      eval_df[1:10,],
      options = list(scrollX = TRUE,
                     deferRender = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     info = FALSE,      
                     paging=FALSE,
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 2: First 10 Rows of Evaluation Data'
  )) 

#Summary statistics
DT::datatable(
      dfSummary(beverage_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 3: Summary Statistics for Beverage Data'
  )) 
 

#Summary statistics
DT::datatable(
      dfSummary(eval_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 4: Summary Statistics for Evaluation Data'
  )) 

stat <- beverage_df %>%
  group_by(`Brand Code`) %>%
  filter(!is.na(`Brand Code`)) %>% 
  dplyr::summarize(
    Min = min(PH, na.rm = TRUE),
    Q1 = quantile(PH, 0.25, na.rm = TRUE),
    Median = median(PH, na.rm = TRUE),
    Mean = mean(PH, na.rm = TRUE),
    Q3 = quantile(PH, 0.75, na.rm = TRUE),
    Max = max(PH, na.rm = TRUE),
    StdDev = sd(PH, na.rm = TRUE),
    Count = n(),
    Missing = sum(is.na(PH)) 
  )

#Summary statistics by code
DT::datatable(
      stat,
      options = list(dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE,
                     paging=FALSE,
                     info = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 5: Summary Statistics PH for Each Brand Code'
  )) %>%
  DT::formatRound(columns = c("Min", "Q1", "Median", "Mean", "Q3", "Max", "StdDev"), digits = 3)


## ----ph_plots-------------------------------------------------------------------------------------------------------------------------------------------------------------
#Distribution PH plot
ggplot(beverage_df, aes(x = PH)) +
  geom_histogram(binwidth = 0.1, fill = "lightgreen", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of PH", x = "PH", y = "Frequency")

#Distribution Brand Code plot
ggplot(beverage_df, aes(x = `Brand Code`)) +
  geom_bar(fill = "lightgreen") +
  theme_minimal() +
  labs(title = "Count by Brand Code", x = "Brand Code", y = "Count")

filtered_df <- beverage_df %>%
  filter(!is.na(`Brand Code`))

#Boxplot brand code vs ph
ggplot(filtered_df, aes(x = `Brand Code`, y = PH, fill = `Brand Code`)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of PH by Brand Code", x = "Brand Code", y = "PH")


## ----group_plots----------------------------------------------------------------------------------------------------------------------------------------------------------
group_1 <- c("Carb Pressure", "Carb Pressure1", "Carb Temp", "Temperature", 
             "Usage cont", "Fill Pressure", "Air Pressurer", "Alch Rel", 
             "Balling", "Balling Lvl")

group_2 <- c("Carb Volume", "Carb Rel", "Fill Ounces", "Oxygen Filler", "Density", 
             "PC Volume", "PSC", "PSC CO2", 
             "PSC Fill", "Pressure Setpoint", "Pressure Vacuum")

group_3 <- c("Mnf Flow", "Carb Flow", "Filler Level", "Filler Speed", "Hyd Pressure1", 
             "Hyd Pressure2", "Hyd Pressure3", "Hyd Pressure4", "MFR", "Bowl Setpoint")

#Group 1 plot
beverage_df %>%
  select(all_of(group_1)) %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(binwidth = 0.3, fill = "lightgreen", color = "black") +
  facet_wrap(~name, scales = "free", ncol = 5) +
  theme_minimal() +
  labs(title = "Distribution of Variables: Group 1")

# Group 2 Plot
beverage_df %>%
  select(all_of(group_2)) %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(binwidth = 0.05, fill = "lightgreen", color = "black") +
  facet_wrap(~name, scales = "free", ncol = 5) +
  theme_minimal() +
  labs(title = "Distribution of Variables: Group 2")

# Group 3 Plot
beverage_df %>%
  select(all_of(group_3)) %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(binwidth = 1, fill = "lightgreen", color = "black") +
  facet_wrap(~name, scales = "free", ncol = 5) +
  theme_minimal() +
  labs(title = "Distribution of Variables: Group 3")


## ----na_scatterplt_ph_vs_features-----------------------------------------------------------------------------------------------------------------------------------------
#NA
gg_miss_var(beverage_df, show_pct = TRUE)

#Choose numeric vars
numeric_vars <- beverage_df %>% 
  select(where(is.numeric)) %>% 
  names()

#Create scatter plots vs PH
for (var in numeric_vars) {
  p <- ggplot(beverage_df, aes_string(x = paste0("`", var, "`"), y = "PH")) +
    geom_point(alpha = 0.6, color = "blue") +
    theme_minimal() +
    labs(title = paste("Scatterplot:", var, "vs PH"), x = var, y = "PH")
  
  print(p)  # Display the plot
}



## ----corr-----------------------------------------------------------------------------------------------------------------------------------------------------------------
tst <- beverage_df %>% 
  select(where(is.numeric))
#Correlation with PH
cor_table <- cor(drop_na(tst))[, "PH"] %>%
  as.data.frame() %>%
  rownames_to_column(var = "Variable") %>%
  rename(Coefficient = ".") %>%
  arrange(desc(Coefficient))

kable(cor_table, "html", escape = F, col.names = c('Variable', 'Coefficient')) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  add_header_above(c("Table 6: Correlation with PH" = 2))

#Corr matrix
rcore <- rcorr(as.matrix(tst))
coeff <- rcore$r
#Filter to include only |r| > 0.1, the rest are 0
filtered_coeff <- coeff
filtered_coeff[abs(filtered_coeff) < 0.1] <- 0 
coeff <- rcore$r
corrplot(filtered_coeff, tl.cex = .6, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust", number.cex=0.45, diag=FALSE)


## ----colnames_zeroVar-----------------------------------------------------------------------------------------------------------------------------------------------------
#Fix column names
colnames(beverage_df) <- make_clean_names(colnames(beverage_df))
#Apply the same to eval_df 
colnames(eval_df) <- make_clean_names(colnames(eval_df))
#Check new column names
colnames(beverage_df)

# Identify zero-variance predictors
zero_var <- nearZeroVar(beverage_df, saveMetrics = TRUE)
print(zero_var[zero_var$nzv, ])

beverage_df <- beverage_df[, !zero_var$nzv]
eval_df <- eval_df[, !zero_var$nzv]


## ----impute_numeric-------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(547)

#KNN imputation
beverage_imputed <- as.data.frame(beverage_df)
beverage_imputed <- knnImputation(beverage_imputed[, !names(beverage_imputed) %in% c("brand_code", "mnf_flow")])
beverage_imputed$brand_code <- beverage_df$brand_code
beverage_imputed$mnf_flow <- beverage_df$mnf_flow

#KNN imputation mnf_flow
beverage_imputed$mnf_flow[beverage_imputed$mnf_flow == -100] <- NA
beverage_imputed <-  knnImputation(beverage_imputed[, !names(beverage_imputed) %in% c("brand_code")], k = 5, scale = TRUE, meth = 'median')  
beverage_imputed$brand_code <- beverage_df$brand_code

eval_imputed <- as.data.frame(eval_df)
eval_imputed <- knnImputation(eval_imputed[, !names(eval_imputed) %in% c("brand_code", "ph")])
eval_imputed$brand_code <- eval_df$brand_code
eval_imputed$ph <- eval_df$ph
#eval_df doesnt have na in mnf_flow
colSums(is.na(beverage_imputed))
colSums(is.na(eval_imputed))


## ----brand_Code_impute----------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(547)
impute_rf <- function(data, target_var, exclude_vars = NULL) {
  #Ensure target_var is a factor
  data[[target_var]] <- factor(data[[target_var]])
  
  #Exclude specified variables from the model input
  if (!is.null(exclude_vars)) {
    model_data <- data[, !(names(data) %in% exclude_vars)]
  } else {
    model_data <- data
  }
  
  #Train model on non-NA data for the target variable
  rf_model <- randomForest(reformulate(".", target_var), data = model_data[!is.na(model_data[[target_var]]),], na.action = na.omit)
  
  #Predict NAs
  na_indices <- is.na(data[[target_var]])
  if (sum(na_indices) > 0) {
    predicted_values <- predict(rf_model, newdata = model_data[na_indices,])
    data[[target_var]][na_indices] <- predicted_values
  }
  
  return(data)
}

beverage_imputed <- impute_rf(beverage_imputed, "brand_code")
eval_imputed <- impute_rf(eval_imputed, "brand_code", exclude_vars = "ph")

#Verify
colSums(is.na(beverage_imputed))
colSums(is.na(eval_imputed))


## ----encode_categorical---------------------------------------------------------------------------------------------------------------------------------------------------
#One-hot Encoding for Brand Code
beverage_imputed$brand_code <- as.factor(beverage_imputed$brand_code)
eval_imputed$brand_code <- as.factor(eval_imputed$brand_code)
#Create dummies
dummy_vars <- dummyVars(~ brand_code, data = beverage_imputed, fullRank = TRUE)
beverage_imputed <- cbind(beverage_imputed, predict(dummy_vars, newdata = beverage_imputed))
beverage_imputed <- beverage_imputed %>% select(-brand_code) 

eval_imputed <- cbind(eval_imputed, predict(dummy_vars, newdata = eval_imputed))
eval_imputed <- eval_imputed %>% select(-brand_code) 


## ----new_features---------------------------------------------------------------------------------------------------------------------------------------------------------
#Creating Interaction Terms
beverage_revised <- beverage_imputed %>%
  select(-contains('_interaction'), -contains('composite'), -contains('efficiency'), -contains('index')) %>%
  mutate(
    setpoint_diff = bowl_setpoint - pressure_setpoint,  # Example of feature engineering
    carb_volume2 = carb_volume^2,  # Polynomial feature for carb_volume
    pc_volume_pressure = pc_volume * carb_pressure  # Interaction feature
  )

eval_revised <- eval_imputed %>%
  select(-contains('_interaction'), -contains('composite'), -contains('efficiency'), -contains('index')) %>%
  mutate(
    setpoint_diff = bowl_setpoint - pressure_setpoint,  # Example of feature engineering
    carb_volume2 = carb_volume^2,  # Polynomial feature for carb_volume
    pc_volume_pressure = pc_volume * carb_pressure  # Interaction feature
  )

#Remove original columns to reduce collinearity
beverage_revised <- beverage_revised %>%
  select(-c(bowl_setpoint, pressure_setpoint, carb_volume, pc_volume, carb_pressure))
eval_revised <- eval_revised %>%
  select(-c(bowl_setpoint, pressure_setpoint, carb_volume, pc_volume, carb_pressure))

# Creating a binned feature for another variable, e.g., carb_temp
beverage_revised$carb_temp_binned <- cut(beverage_revised$carb_temp,
                                         breaks=quantile(beverage_revised$carb_temp, probs=seq(0, 1, 0.25)),
                                         include.lowest=TRUE,
                                         labels=FALSE)

eval_revised$carb_temp_binned <- cut(eval_revised$carb_temp,
                                         breaks=quantile(eval_revised$carb_temp, probs=seq(0, 1, 0.25)),
                                         include.lowest=TRUE,
                                         labels=FALSE)

# Optionally, remove original carb_temp if the binned version is preferred
beverage_revised <- beverage_revised %>%
  select(-carb_temp)

eval_revised <- eval_revised %>%
  select(-carb_temp)


## ----transform_vars-------------------------------------------------------------------------------------------------------------------------------------------------------
# Apply transformations
set.seed(123)
# Log transformations
beverage_transformed <- beverage_imputed
beverage_transformed$filler_speed <- log(beverage_transformed$filler_speed)
beverage_transformed$mfr <- log(beverage_transformed$mfr)

# Square root transformations
beverage_transformed$temperature <- sqrt(beverage_transformed$temperature)
beverage_transformed$air_pressurer <- sqrt(beverage_transformed$air_pressurer)

# Shifted log transformations
beverage_transformed$oxygen_filler <- log(beverage_transformed$oxygen_filler + 1)
beverage_transformed$psc_co2 <- log(beverage_transformed$psc_co2 + 1)

# Log transformations
eval_transformed <- eval_imputed
eval_transformed$filler_speed <- log(eval_imputed$filler_speed)
eval_transformed$mfr <- log(eval_imputed$mfr)

# Square root transformations
eval_transformed$temperature <- sqrt(eval_transformed$temperature)
eval_transformed$air_pressurer <- sqrt(eval_transformed$air_pressurer)

# Shifted log transformations
eval_transformed$oxygen_filler <- log(eval_transformed$oxygen_filler + 1)
eval_transformed$psc_co2 <- log(eval_transformed$psc_co2 + 1)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Summary statistics
DT::datatable(
      dfSummary(beverage_revised, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 7: Summary Statistics for Transformed Beverage Data'
  )) 
 

#Summary statistics
DT::datatable(
      dfSummary(eval_revised, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 8: Summary Statistics for Transformed Evaluation Data'
  )) 


## ----parallel_core--------------------------------------------------------------------------------------------------------------------------------------------------------
#Detect the number of available cores
numCores <- detectCores()

#Register parallel backend (use numCores - 1 to leave 1 core free for system tasks)
cl <- makeCluster(numCores - 1)
registerDoParallel(cl)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Data with new features
set.seed(123)
trainIndex <- createDataPartition(beverage_revised$ph, p = 0.8, list = FALSE)
train_revised <- beverage_revised[trainIndex, ]
test_revised <- beverage_revised[-trainIndex, ]
eval_revised <- subset(eval_revised, select = -ph)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
trainIndex <- createDataPartition(beverage_transformed$ph, p = 0.8, list = FALSE)
train_transformed <- beverage_transformed[trainIndex, ]
test_transformed <- beverage_transformed[-trainIndex, ]
eval_transformed <- subset(eval_transformed, select = -ph)
#Setup cv
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, allowParallel = TRUE)


## ----mars_model-----------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
marsGrid <- expand.grid(degree = 1:2, nprune = 2:20)

#Train MARS
marsModel <- train(ph ~ ., data = train_transformed, 
                   method = "earth",
                   preProc = c("center", "scale"),
                   tuneGrid = marsGrid,
                   trControl = control
)


#degree nprune      RMSE   Rsquared       MAE   
# 2     20      0.1269369   0.459168    0.0960094 
mars_tune <- marsModel$results[marsModel$results$nprune == marsModel$bestTune$nprune & marsModel$results$degree == marsModel$bestTune$degree, ]
mars_tune

#Predict
marsPred <- predict(marsModel, newdata = test_transformed)
#RMSE      Rsquared        MAE 
#0.12208106 0.51031566 0.09219265
mars_results <- postResample(pred = marsPred, obs = test_transformed$ph)
mars_results


## ----mars_varimpt---------------------------------------------------------------------------------------------------------------------------------------------------------
importance_mars <- varImp(marsModel, scale = FALSE)
importance_mars
plot(importance_mars, top = 10, main = "Top 10 predictors, MARS Model")


## ----svm_model------------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
svmGrid <- expand.grid(sigma = c(0.001, 0.01, 0.1), C = c(0.1, 1, 10, 100))
#Train SVM
svmModel <- train(ph ~ ., data = train_transformed, 
                  method = "svmRadial", 
                  preProc = c("center", "scale"),
                  tuneGrid = svmGrid,
                  trControl = control
                  )

#sigma  C      RMSE   Rsquared        MAE   
# 0.01 10  0.1193689  0.5235771   0.08771225
svm_tune <- svmModel$results[svmModel$results$sigma == svmModel$bestTune$sigma & svmModel$results$C == svmModel$bestTune$C, ]
svm_tune

#Predict
svmPred <- predict(svmModel, newdata = test_transformed)
#RMSE   Rsquared        MAE 
#0.11787203 0.54495184 0.08445738 
svm_results <- postResample(pred = svmPred, obs = test_transformed$ph)
svm_results


## ----svm_varimpt----------------------------------------------------------------------------------------------------------------------------------------------------------
importance_svm <- varImp(svmModel, scale = FALSE)
importance_svm
plot(importance_svm, top = 10, main = "Top 10 predictors, SVM Model")


## ----nnet_model-----------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
nnetGrid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = 1:10, .bag = FALSE)

#Train avNNet
nnetModel <- train(ph ~ ., data = train_transformed, 
                   method = "avNNet",
                   preProc = c("center", "scale"),
                   tuneGrid = nnetGrid,
                   trControl = control,
                   linout = TRUE,
                   trace = FALSE)  

#decay size   bag      RMSE    Rsquared       MAE
#  0   10   FALSE   0.1145209   0.5585798 0.0855949 
nnet_tune <- nnetModel$results[nnetModel$results$size == nnetModel$bestTune$size & nnetModel$results$decay == nnetModel$bestTune$decay, ]
nnet_tune

#Predict
nnetPred <- predict(nnetModel, newdata = test_transformed)
#RMSE        Rsquared        MAE 
#0.11073852 0.59788239 0.08266435
nnet_results <- postResample(pred = nnetPred, obs = test_transformed$ph)
nnet_results


## ----nnet_varimpt---------------------------------------------------------------------------------------------------------------------------------------------------------
importance_nnet <- varImp(nnetModel, scale = FALSE)
importance_nnet
plot(importance_nnet, top = 10, main = "Top 10 predictors, avNNet Model")


## ----gbm_model------------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
gbmGrid <- expand.grid(.n.trees = seq(100, 1000, by = 100), .interaction.depth = seq(1, 7, by = 2), .shrinkage = 0.01, .n.minobsinnode = c(5, 10, 15))
#Train GBM
gbm_model <- train(ph ~ ., data = train_revised, 
                   method = "gbm", 
                   preProc = c("center", "scale"),
                   tuneGrid = gbmGrid, 
                   trControl = control,
                   verbose = FALSE)

#shrinkage interaction.depth n.minobsinnode n.trees      RMSE     Rsquared        MAE   
# 0.01                 7             10    1000       0.1101819    0.5955535 0.08304377
gbm_tune <- gbm_model$results[gbm_model$results$n.trees == gbm_model$bestTune$n.trees & gbm_model$results$interaction.depth == gbm_model$bestTune$interaction.depth & gbm_model$results$n.minobsinnode == gbm_model$bestTune$n.minobsinnode,]
gbm_tune

#Predict
gbm_predictions <- predict(gbm_model, newdata = test_revised)
#      RMSE   Rsquared        MAE 
#0.10548388 0.64448254 0.07928038 
gbm_results <- postResample(pred = gbm_predictions, obs = test_revised$ph)
gbm_results


## ----gbm_varimpt----------------------------------------------------------------------------------------------------------------------------------------------------------
importance_gbm <- varImp(gbm_model, scale = FALSE)
importance_gbm
plot(importance_gbm, top = 10, main = "Top 10 predictors, GBM")


## ----rf_model-------------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
rfGrid <- expand.grid(.mtry = c(2, 4, 6, 8, 10))

#Train rf
rf_model <- train(ph ~ ., data = train_revised, 
                  method = "rf",
                  preProc = c("center", "scale"),
                  trControl = control,
                  tuneGrid = rfGrid,
                  importance = TRUE) 

#mtry      RMSE   Rsquared        MAE   
# 10  0.1007928     0.6741332   0.07271081 
rf_tune <- rf_model$results[rf_model$results$mtry == rf_model$bestTune$mtry, ]
rf_tune

#Predict
rf_pred <- predict(rf_model, newdata = test_revised)
#RMSE      Rsquared        MAE 
#0.09630017 0.71303966 0.06989033 
rf_results <- postResample(pred = rf_pred, obs = test_revised$ph)
rf_results


## ----rf_variimpt----------------------------------------------------------------------------------------------------------------------------------------------------------
importance_rf <- varImp(rf_model, scale = FALSE)
importance_rf
plot(importance_rf, top = 10, main = "Top 10 predictors, Random Forest")


## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Stop parallel once youre done with models
stopCluster(cl)


## ----compare_models-------------------------------------------------------------------------------------------------------------------------------------------------------
#Create empty df
results <- data.frame(
  Model = character(),
  Resample_RMSE = numeric(),
  Resample_R2 = numeric(),
  Resample_MAE = numeric(),
  Test_RMSE = numeric(),
  Test_R2 = numeric(),
  Test_MAE = numeric(),
  stringsAsFactors = FALSE
)

#Fill df with results
results <- rbind(results, data.frame(Model = "MARS", Resample_RMSE = mars_tune$RMSE, Resample_R2 = mars_tune$Rsquared, Resample_MAE = mars_tune$MAE, Test_RMSE = mars_results[1], Test_R2 = mars_results[2], Test_MAE = mars_results[3]))
results <- rbind(results, data.frame(Model = "SVM", Resample_RMSE = svm_tune$RMSE, Resample_R2 = svm_tune$Rsquared, Resample_MAE = svm_tune$MAE, Test_RMSE = svm_results[1], Test_R2 = svm_results[2], Test_MAE = svm_results[3]))
results <- rbind(results, data.frame(Model = "NNet", Resample_RMSE = nnet_tune$RMSE, Resample_R2 = nnet_tune$Rsquared, Resample_MAE = nnet_tune$MAE, Test_RMSE = nnet_results[1], Test_R2 = nnet_results[2], Test_MAE = nnet_results[3]))
results <- rbind(results, data.frame(Model = "GBM", Resample_RMSE = gbm_tune$RMSE, Resample_R2 = gbm_tune$Rsquared, Resample_MAE = gbm_tune$MAE, Test_RMSE = gbm_results[1], Test_R2 = gbm_results[2], Test_MAE = gbm_results[3]))
results <- rbind(results, data.frame(Model = "RF", Resample_RMSE = rf_tune$RMSE, Resample_R2 = rf_tune$Rsquared, Resample_MAE = rf_tune$MAE, Test_RMSE = rf_results[1], Test_R2 = rf_results[2], Test_MAE = rf_results[3]))
row.names(results) <- NULL

kable(results, "html", escape = FALSE, align = c('l', rep('c', 6)), col.names = c("Model", "Resample RMSE", "Resample R²", "Resample MAE", "Test RMSE", "Test R²", "Test MAE")) %>%
  kable_styling("striped", full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  add_header_above(c(" " = 1, "Resample Metrics" = 3, "Test Metrics" = 3), 
                   bold = TRUE, background = "#D3D3D3", color = "black")


## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Predictions for the evaluation dataset
eval_pred <- predict(rf_model, newdata = eval_revised)

#Create df
eval_results <- data.frame(
  ph_pred = eval_pred
)

eval_revised_pred <- eval_revised
eval_revised_pred$ph_pred <- eval_pred

#Save to Excel
write_xlsx(eval_revised_pred, "eval_revised_pred.xlsx")
write_xlsx(eval_results, "ph_predictions.xlsx")


## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#average ph
eval_revised_pred  %>%
  group_by(brand_code) %>%
  summarise(`Average PH` = mean(PH))

stat <- eval_revised_pred  %>%
  group_by(`Brand Code`) %>%
  filter(!is.na(`Brand Code`)) %>% 
  dplyr::summarize(
    Min = min(PH, na.rm = TRUE),
    Q1 = quantile(PH, 0.25, na.rm = TRUE),
    Median = median(PH, na.rm = TRUE),
    Mean = mean(PH, na.rm = TRUE),
    Q3 = quantile(PH, 0.75, na.rm = TRUE),
    Max = max(PH, na.rm = TRUE),
    StdDev = sd(PH, na.rm = TRUE),
    Count = n(),
    Missing = sum(is.na(PH)) 
  )

#Summary statistics by code
DT::datatable(
      stat,
      options = list(dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE,
                     paging=FALSE,
                     info = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 5: Summary Statistics PH for Each Brand Code'
  )) %>%
  DT::formatRound(columns = c("Min", "Q1", "Median", "Mean", "Q3", "Max", "StdDev"), digits = 3)


stat <- beverage_df %>%
  group_by(`Brand Code`) %>%
  filter(!is.na(`Brand Code`)) %>% 
  dplyr::summarize(
    Min = min(PH, na.rm = TRUE),
    Q1 = quantile(PH, 0.25, na.rm = TRUE),
    Median = median(PH, na.rm = TRUE),
    Mean = mean(PH, na.rm = TRUE),
    Q3 = quantile(PH, 0.75, na.rm = TRUE),
    Max = max(PH, na.rm = TRUE),
    StdDev = sd(PH, na.rm = TRUE),
    Count = n(),
    Missing = sum(is.na(PH)) 
  )

#Summary statistics by code
DT::datatable(
      stat,
      options = list(dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE,
                     paging=FALSE,
                     info = FALSE), 
      rownames = FALSE,
      caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
    'Table 5: Summary Statistics PH for Each Brand Code'
  )) %>%
  DT::formatRound(columns = c("Min", "Q1", "Median", "Mean", "Q3", "Max", "StdDev"), digits = 3)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
library(knitr)
purl("daria_draft.Rmd", output = "extracted_code.R")

Project 2: Predicting pH for ABC Beverage

Banu Boopalan, Molly Siebecker, Marley Myrianthopoulos, Jonathan Burns, Daria Dubovskaia

Overview & Introduction

Objective:

Introduction:

Key Findings:

Background:

1. Data Exploration

1.1 Load data

1.2 Summary Statistics

1.3 EDA

Featured Variables:

EDA Takeaways:

EDA - Distributions:

Initial Missingness:

1.4 Correlation Matrix

Preliminary Findings:

2. Data Preparation

Identification of problem areas

2.1 Missing values

2.2 New features

Comparing newly formed and transformed variables

3. Build Models

4. Model Selection

5. Conclusion

References

Appendix: R Code