The purpose of this analysis is to understand the beverage manufacturing process at the ABC Beverage company. This is done by completing an expansive evaluation of pH levels, which consists of preliminary data analysis and scrupulous model selection for predicting pH levels in the manufacturing process.
We began this analysis by looking into which variable would present issues for the model. There was a glaring problem with the state of the original beverage data. Variables Brand Code and MFR were missing 4.5% and 8.2% of data respectively. Following the discovery of missingness,the group moved to further evaluate the data through thorough exploratory data analysis. Within the EDA we confirmed that the dataset contained missing values, skewness, and outliers, with key variables like Brand Code and MFR requiring special attention in the working steps prior to modeling.
After cleaning the data we shifted our attention to building models and evaluating performance. The group ended on five handpicked models to evaluate. Random Forest, Gradient Boosting, Neural Networks, Support Vector Machine, and MARS were chosen. We thought that these models would cover a range of techniques instead of focusing on one area of predictive modeling. Lastly we settled on one singular model that was more performant that any of the others.
Due to the high performance relative to the other four models the group chose that random forest was the best fit. While evaluating the top 10 predictors in our random forest model the mnf_flow variable stood out the most. Carrying near 45% of the importance in the model the mnf_flow is undoubtedly the most important variable, in consultation with the correlation matrix in the EDA section, we can deduce that mnf_flow is not only important but also negatively correlated with pH levels in beverage production. The model behind this discovery is a random forest model which boasts a Test R-Squared of 71.3%, a Test RMSE of 0.096 and a Test MAE of 0.069 which highlights this models accuracy and reliability in predicting pH levels.
pH is a critical quality metric in beverage production because it directly affects the flavor, safety, shelf life, and consistency of the product. Maintaining the correct pH level ensures that the beverage is resistant to microbial growth and chemical deterioration, which is essential for consumer safety and product stability. Regulatory requirements mandate that beverage manufacturers closely monitor and control pH levels to meet industry standards and ensure product quality. Non-compliance with these regulations can lead to product recalls, legal issues, and damage to the company’s reputation. The FDA has a slew of resources and best practices about this.
Data for training and evaluation are obtained from the GitHub repo via URLs, ensuring that the most recent version is used for analysis. Temporary files are used as intermediate storage to speed up the data collecting process.
Beverage dataset: 2571 rows x 33 columns.
Evaluation dataset: 267 rows x 33 columns.
Mostly numeric, with one categorical variable (Brand Code).
NAs:
Brand Code: ~4.7% missing in beverage_df, ~3.0% missing in
eval_df.
Many numeric features have a small percentage of missing values,
generally < 2%.
PH: 4 NAs for Brand code B, the target variable in beverage_df. eval_df
has all values missing for PH (used for prediction).
Impute missing values for numeric features using median or mean (or more
advanced imputation if correlated features exist). Missing values should
be handled within a robust pipeline to avoid information loss.
Skewness and outliers
Variables like Mnf Flow have extreme values (mean is heavily influenced
by outliers). Range: -100.2 to 229.4.
PH: Check for extreme pH values that may affect model accuracy.
Hyd Pressure1–4: High standard deviations (e.g., Hyd Pressure2 with a
mean of ~21 and SD of 16.4).
Analyze the distribution of these variables using histograms or
boxplots. Maybe winsorization or BoxCox/log transformation for skewed
distributions.
Feature Importance for PH pred
Carb Volume, Fill Ounces, PC Volume, Carb Pressure, and Carb Temp have
small sd and are likely controlled manufacturing parameters. These might
directly influence pH.
Brand Code can be treated as a categorical predictor for brand-specific
variations.
Correlation or feature importance to identify which variables most
strongly influence PH.
Brand Code: 4 levels (A, B, C, D).
Unbalanced distribution: B accounts for ~50% of records, while A, C, and
D are much smaller. The imbalance might affect models like decision
trees or ensemble methods.
Apply stratified sampling or weighting to handle imbalance during
training. Explore interaction effects between Brand Code and numeric
variables.
Multicollinearity
Variables such as Carb Volume, PC Volume, and Carb Pressure might be
correlated due to their role in carbonation.
Multiple pressure-related variables (Carb Pressure1, Fill Pressure,
etc.) and filler speed/level metrics could also have collinear
relationships.
Compute a correlation matrix to detect highly correlated
predictors.
Data Leakage
Variables like PSC, PSC Fill, and PSC CO could be downstream measures
dependent on pH. Confirm whether these are part of the production
process or outcome metrics.
Analyze the production process to ensure no data leakage into the
model.
eval_df
All PH values are missing for prediction.
Structure and missingness are similar to beverage_df. Ensure
preprocessing and feature engineering pipelines are consistent between
training and evaluation datasets.
Carb Volume: Volume of the beverage that is
carbonated.
Fill Ounces: Volume of the beverage filled into
containers, measured in ounces.
PC Volume: Could indicate “Process Control Volume,” a
measure related to production parameters.
Carb Pressure: Pressure used during the carbonation
process.
Carb Temp: Temperature of the beverage during
carbonation.
PSC: Potentially a quality control metric, the exact
meaning would depend on industry context.
PSC Fill: Quality or quantity metric related to the
filling process.
PSC CO2: Measures CO2 levels or pressure during
processing.
Mnf Flow: Manufacturing flow rate, likely related to
how fast the production line operates.
Carb Pressure1: Another measurement of pressure during
carbonation, possibly at a different production stage.
Fill Pressure: Pressure maintained during the filling
of beverages into containers.
Hyd Pressure1, Hyd Pressure2, Hyd Pressure3, Hyd
Pressure4: Different points of hydraulic pressure measurement
within the machinery.
Filler Level: Level to which containers are filled;
critical for consistency.
Filler Speed: Speed at which the filling machinery
operates.
Temperature: General temperature measurement.
Carb Flow: Flow rate of the carbonation process.
Density: Physical density of the beverage, which
impacts taste and mouth feel.
MFR: Manufacturer’s reference; could relate to specific
production settings or machine identifiers.
Balling: Measurement related to the sugar content of a
liquid, typically used in brewing.
Pressure Vacuum: Pressure readings from a vacuum
process, possibly in packaging.
Oxygen Filler: Likely measures the presence of oxygen
during filling, which must be minimized in certain beverages.
Bowl Setpoint: Setpoint for a particular part of the
machinery, possibly related to mixing or blending.
Pressure Setpoint: Target pressure to be maintained
within certain machinery or process stages.
Air Pressurer: Air pressure readings.
Alch Rel: Could be “Alcohol Release” or related to the
alcohol content management.
Carb Rel: Possibly “Carbonation Reliability” or a
similar measure of carbonation consistency.
Balling Lvl: Specific gravity of the liquid related to
its sugar content, important in quality control in brewing and beverage
manufacturing.
The pH of a solution is measured on a scale from 0 to 14, indicating its acidity or alkalinity. In beverage manufacturing, maintaining the correct pH level is crucial for ensuring flavor, safety, shelf life, and uniformity.
The data shows that the most common pH level is around 8.5, which appears to be the target for many batches of beverages. For carbonated beverages, a slightly alkaline pH level can: - Improve Taste: Balance the tang of carbonation with a smoother, less acidic flavor. - Aid in Shelf Stability: Ensure the beverage is resistant to microbial growth and chemical deterioration. - Reflect Process Consistency: Indicate consistent raw material quality, carbonation levels, and filling precision.
The distribution of pH is roughly normal with a slight negative skew, centered around 8.5, and includes some outliers at both the lower and upper tails.
Brand B has significantly more entries compared to other brands, suggesting the need for balancing or stratified sampling during model training. The pH distribution varies across Brand Codes, with Brand C having the lowest median pH and Brand B the highest. Outliers are present in all brand codes, and it is important to investigate whether these outliers are due to measurement errors or valid extreme cases.
Several variables in the dataset exhibit notable characteristics and potential issues:
For variables like MFR and Usage Cont, applying log or Box-Cox transformations can help normalize their distributions.
Mnf Flow: - A value of -100 appears frequently, which could indicate: - Missing Data: A common placeholder value. - Domain-Specific Encoding: For example, representing “no flow” or a specific state in manufacturing. - Data Error: A mistake in data collection or recording. - It is essential to confirm whether -100 is a legitimate value or a placeholder. If it is a placeholder, it should be treated as missing and imputed or removed to ensure data integrity.
The dataset contains varying levels of missing values across different variables:
The data revealed several important correlations and potential multicollinearity issue. We used this correlation matrix to try and begin to understand the interworkings of the production process. Below is a write-up of these initial findings.
Key Variable Correlations: - Bowl Setpoint (correlation: 0.36): This parameter likely affects the flow or intensity of ingredient mixing during carbonation or flavoring stages. A higher bowl setpoint may result in more uniform mixing, reducing pH variability. - Filler Level (correlation: 0.35): Ensures containers are properly filled to prevent extra air or gas imbalances. Variations in filler level can influence carbonation levels and pH, so proper calibration is essential. - Pressure Vacuum (correlation: 0.22): Assists with gas exchange during filling. Variations in vacuum pressure can cause inconsistent carbonation, affecting pH. Monitoring this parameter ensures the desired carbonation “bite.” - Oxygen Filler (correlation: 0.17): Determines the amount of oxygen introduced during filling. Higher oxygen levels can accelerate oxidation, reducing pH and compromising flavor stability. Maintaining low oxygen levels is crucial for product quality. - Mnf Flow (correlation: -0.45): Describes the rate at which materials flow during production. A higher flow rate may result in irregular mixing, reduced carbonation, and a shift in pH. Adjusting flow rates helps maintain consistent product attributes.
| Variable | Coefficient |
|---|---|
| PH | 1.0000000 |
| Bowl Setpoint | 0.3615875 |
| Filler Level | 0.3520440 |
| Carb Flow | 0.2335937 |
| Pressure Vacuum | 0.2197355 |
| Carb Rel | 0.1960515 |
| Alch Rel | 0.1666822 |
| Oxygen Filler | 0.1644854 |
| Balling Lvl | 0.1093712 |
| PC Volume | 0.0988667 |
| Density | 0.0955469 |
| Balling | 0.0767002 |
| Carb Pressure | 0.0762134 |
| Carb Volume | 0.0721325 |
| Carb Temp | 0.0322794 |
| Air Pressurer | -0.0079972 |
| PSC Fill | -0.0238098 |
| Filler Speed | -0.0408830 |
| MFR | -0.0451965 |
| Hyd Pressure1 | -0.0470664 |
| PSC | -0.0698730 |
| PSC CO2 | -0.0852599 |
| Fill Ounces | -0.1183359 |
| Carb Pressure1 | -0.1187642 |
| Hyd Pressure4 | -0.1714340 |
| Temperature | -0.1826596 |
| Hyd Pressure2 | -0.2226600 |
| Hyd Pressure3 | -0.2681018 |
| Pressure Setpoint | -0.3116639 |
| Fill Pressure | -0.3165145 |
| Usage cont | -0.3576120 |
| Mnf Flow | -0.4592313 |
make_clean_names() to eliminate issues caused by
case sensitivity or spaces, making data manipulation easier and less
error-prone.## [1] "brand_code" "carb_volume" "fill_ounces"
## [4] "pc_volume" "carb_pressure" "carb_temp"
## [7] "psc" "psc_fill" "psc_co2"
## [10] "mnf_flow" "carb_pressure1" "fill_pressure"
## [13] "hyd_pressure1" "hyd_pressure2" "hyd_pressure3"
## [16] "hyd_pressure4" "filler_level" "filler_speed"
## [19] "temperature" "usage_cont" "carb_flow"
## [22] "density" "mfr" "balling"
## [25] "pressure_vacuum" "ph" "oxygen_filler"
## [28] "bowl_setpoint" "pressure_setpoint" "air_pressurer"
## [31] "alch_rel" "carb_rel" "balling_lvl"
## freqRatio percentUnique zeroVar nzv
## hyd_pressure1 31.11111 9.529366 FALSE TRUE
Imputing missing values knnImputation was used for for numeric variables and Random Forest was used for Brand Code and mnf_flow.
## carb_volume fill_ounces pc_volume carb_pressure
## 0 0 0 0
## carb_temp psc psc_fill psc_co2
## 0 0 0 0
## carb_pressure1 fill_pressure hyd_pressure2 hyd_pressure3
## 0 0 0 0
## hyd_pressure4 filler_level filler_speed temperature
## 0 0 0 0
## usage_cont carb_flow density mfr
## 0 0 0 0
## balling pressure_vacuum ph oxygen_filler
## 0 0 0 0
## bowl_setpoint pressure_setpoint air_pressurer alch_rel
## 0 0 0 0
## carb_rel balling_lvl mnf_flow brand_code
## 0 0 0 120
## carb_volume fill_ounces pc_volume carb_pressure
## 0 0 0 0
## carb_temp psc psc_fill psc_co2
## 0 0 0 0
## mnf_flow carb_pressure1 fill_pressure hyd_pressure2
## 0 0 0 0
## hyd_pressure3 hyd_pressure4 filler_level filler_speed
## 0 0 0 0
## temperature usage_cont carb_flow density
## 0 0 0 0
## mfr balling pressure_vacuum oxygen_filler
## 0 0 0 0
## bowl_setpoint pressure_setpoint air_pressurer alch_rel
## 0 0 0 0
## carb_rel balling_lvl brand_code ph
## 0 0 8 267
## carb_volume fill_ounces pc_volume carb_pressure
## 0 0 0 0
## carb_temp psc psc_fill psc_co2
## 0 0 0 0
## carb_pressure1 fill_pressure hyd_pressure2 hyd_pressure3
## 0 0 0 0
## hyd_pressure4 filler_level filler_speed temperature
## 0 0 0 0
## usage_cont carb_flow density mfr
## 0 0 0 0
## balling pressure_vacuum ph oxygen_filler
## 0 0 0 0
## bowl_setpoint pressure_setpoint air_pressurer alch_rel
## 0 0 0 0
## carb_rel balling_lvl mnf_flow brand_code
## 0 0 0 0
## carb_volume fill_ounces pc_volume carb_pressure
## 0 0 0 0
## carb_temp psc psc_fill psc_co2
## 0 0 0 0
## mnf_flow carb_pressure1 fill_pressure hyd_pressure2
## 0 0 0 0
## hyd_pressure3 hyd_pressure4 filler_level filler_speed
## 0 0 0 0
## temperature usage_cont carb_flow density
## 0 0 0 0
## mfr balling pressure_vacuum oxygen_filler
## 0 0 0 0
## bowl_setpoint pressure_setpoint air_pressurer alch_rel
## 0 0 0 0
## carb_rel balling_lvl brand_code ph
## 0 0 0 267
One hot encoding, interaction terms
The team created six new transformation for six original variables in the dataset to try an improve performance. The changes were seen across both the original dataset and the eval dataset.
Transformed Variables and the transformations applied: * filler_speed - (log) * mfr - (log) * temperature (sqrt) * air_pressure (sqrt) * oxygen_filter (shifted log) * psc_co2 (shifted log)
Tables 7 and 8 were created to visualize this change in both new transformed df’s.
Splitting data into training and testing sets (via createDataPartition()). Its goal is to test the model using previously unseen data in order to predict how well it would perform in real-world circumstances. Using a stratified sampling strategy ensures that the distribution of essential variables (such as pH levels) remains consistent throughout the training and testing sets.
The use of numerous models (Random Forest, Gradient Boosting, Neural Networks, Support Vector Machine, and MARS) provides a comprehensive view of how various techniques perform with the provided data.
MARS is a flexible non-parametric technique capable of effectively
modeling nonlinearities and interactions.
Support Vector Machine: Effective for determining the ideal border
between outputs; best suited for boundary decisions rather than
probability estimations. Neural Networks: Ideal for collecting
complicated patterns in huge datasets, but may demand significant
computational resources.
Gradient Boosting: Effective for managing a variety of data kinds and
gradually improves performance through boosting.
Random Forest: Effective for handling non-linear relationships and
resistant to overfitting.
Cross-validation (specified in trainControl()) ensures that the model is rigorously validated and not tailored to a single subset of the data. Repeated cross-validation helps to reduce variance in the model evaluation process, resulting in a more accurate performance estimate. Using grids (expand.grid()) for distinct model parameters enables systematic searching through a set of alternatives to discover the ideal combination that maximizes model performance.
MARS
MARS is a versatile modeling method that successfully captures
nonlinearities and variable interactions. Preparation entails scaling
and centring the data to standardize feature scales to focus on
discovering significant interactions and trends without being influenced
by different variable scales.
Trained using earth method with a parameter grid specifying degrees and
number of pruning terms (nprune). Centering and scaling were applied to
normalize the variables. MARS builds models by generating piecewise
linear splines, which are especially useful for dealing with complex,
non-linear interactions that may be present in data from various
manufacturing processes. To avoid overfitting, the model starts with the
maximum number of basis functions and prunes them using the generalized
cross-validation criterion.
The best model, with a degree of 2 and 20 pruning terms, showed an RMSE of 0.1269, R-squared of 0.459, and MAE of 0.0960 on the training set, indicating moderate predictive accuracy. mnf_flow was the most influential, followed by variables like brand_code.C, and hyd_pressure3, suggesting these factors are critical in predicting pH levels.
## degree nprune RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 38 2 20 0.1269369 0.459168 0.0960094 0.01134229 0.07564645 0.00693096
## RMSE Rsquared MAE
## 0.12208106 0.51031566 0.09219265
Overall
mnf_flow 100.00 brand_code.C 72.84 hyd_pressure3 70.05 alch_rel 64.81 bowl_setpoint 60.48 pressure_vacuum 57.10 air_pressurer 49.35 balling 42.60 density 39.93 temperature 30.51 usage_cont 22.44 carb_pressure1 18.34
## earth variable importance
##
## Overall
## mnf_flow 100.00
## brand_code.C 72.84
## hyd_pressure3 70.05
## alch_rel 64.81
## bowl_setpoint 60.48
## pressure_vacuum 57.10
## air_pressurer 49.35
## balling 42.60
## density 39.93
## temperature 30.51
## usage_cont 22.44
## carb_pressure1 18.34
Support Vector Machine
SVM data preparation involves scaling and centering, which are critical for this model because it creates a hyperplane in high-dimensional space to distinguish various classes. The scale of the data influences the hyperplane’s location, and unscaled data might mislead the learning algorithm by emphasizing larger-scale elements disproportionately.
SVM is constructed by determining the best hyperplane that divides the classes with the greatest margin. It is resistant to outliers and effective in high-dimensional spaces. Kernel functions, such as the radial basis function utilized here, enable SVM to learn non-linear bounds, making it suitable for a variety of data patterns.
Achieved an RMSE of 0.1194, R-squared of 0.5236, and MAE of 0.0877, showing slightly better performance than MARS.
Illustrated less reliance on any single variable, with mnf_flow still significant but less dominant, showcasing SVM’s distribution of importance across multiple features.
## sigma C RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 7 0.01 10 0.1193689 0.5235771 0.08771225 0.008812062 0.05431116 0.004872963
## RMSE Rsquared MAE
## 0.11787203 0.54495184 0.08445738
Variable importance
Overall
mnf_flow 0.229134 usage_cont 0.149547 bowl_setpoint 0.114516 filler_level 0.101887 pressure_setpoint 0.097533 carb_flow 0.086556 brand_code.C 0.078800 hyd_pressure3 0.053540 pressure_vacuum 0.048704 fill_pressure 0.043607 hyd_pressure2 0.034975 brand_code.D 0.034145 oxygen_filler 0.032301 carb_rel 0.024706 hyd_pressure4 0.024241 temperature 0.024065 alch_rel 0.019814 brand_code.B 0.012828 psc 0.009141 fill_ounces 0.008995
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 33)
##
## Overall
## mnf_flow 0.229134
## usage_cont 0.149547
## bowl_setpoint 0.114516
## filler_level 0.101887
## pressure_setpoint 0.097533
## carb_flow 0.086556
## brand_code.C 0.078800
## hyd_pressure3 0.053540
## pressure_vacuum 0.048704
## fill_pressure 0.043607
## hyd_pressure2 0.034975
## brand_code.D 0.034145
## oxygen_filler 0.032301
## carb_rel 0.024706
## hyd_pressure4 0.024241
## temperature 0.024065
## alch_rel 0.019814
## brand_code.B 0.012828
## psc 0.009141
## fill_ounces 0.008995
Neural networks
Data standardization is required for neural networks to learn efficiently. Scaling and centering reduce the dominance of higher magnitude features and improve gradient descent optimization by guaranteeing more uniform scaling across all inputs.
Neural networks are made up of layers of interconnected nodes, with each connection representing a weight; the nodes apply activation functions to their input data. Training entails altering these weights based on the difference between expected and actual results. The model utilized here, avNNet, is an ensemble of neural networks that stabilize predictions.
Noted for the best R-squared of 0.5586 among the models with an RMSE of 0.1145 and MAE of 0.0856, suggesting a strong fit.
Similar to SVM, shows a balanced importance across several predictors, highlighting the model’s use of complex feature interactions.
## decay size bag RMSE Rsquared MAE RMSESD RsquaredSD
## 10 0 10 FALSE 0.1145209 0.5585798 0.0855949 0.01006998 0.06574056
## MAESD
## 10 0.005049342
## RMSE Rsquared MAE
## 0.11073852 0.59788239 0.08266435
Variable importance
Overall
mnf_flow 0.229134 usage_cont 0.149547 bowl_setpoint 0.114516 filler_level 0.101887 pressure_setpoint 0.097533 carb_flow 0.086556 brand_code.C 0.078800 hyd_pressure3 0.053540 pressure_vacuum 0.048704 fill_pressure 0.043607 hyd_pressure2 0.034975 brand_code.D 0.034145 oxygen_filler 0.032301 carb_rel 0.024706 hyd_pressure4 0.024241 temperature 0.024065 alch_rel 0.019814 brand_code.B 0.012828 psc 0.009141 fill_ounces 0.008995
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 33)
##
## Overall
## mnf_flow 0.229134
## usage_cont 0.149547
## bowl_setpoint 0.114516
## filler_level 0.101887
## pressure_setpoint 0.097533
## carb_flow 0.086556
## brand_code.C 0.078800
## hyd_pressure3 0.053540
## pressure_vacuum 0.048704
## fill_pressure 0.043607
## hyd_pressure2 0.034975
## brand_code.D 0.034145
## oxygen_filler 0.032301
## carb_rel 0.024706
## hyd_pressure4 0.024241
## temperature 0.024065
## alch_rel 0.019814
## brand_code.B 0.012828
## psc 0.009141
## fill_ounces 0.008995
Gradient Boosting
Scaling is not required for GBM preparation because the model is based on scale-invariant decision trees. However, centering and scaling are used here to ensure consistency in preprocessing across models, which aids in comparative model evaluation.
GBM transforms an ensemble of weak prediction models, often decision trees, into a strong learner in a sequential process. Each tree is designed to remedy mistakes in earlier trees. The learning process and model complexity are controlled by parameters such as the number of trees, tree depth, and learning rate.
Tuned over a range of tree numbers, interaction depths, and minimum observations in nodes, using a shrinkage factor to regulate learning.
Exhibited the best performance metrics in this set with an RMSE of 0.1102, R-squared of 0.5956, and MAE of 0.0830.
mnf_flow and brand_code.C were significant, along with other operational parameters, indicating their pivotal roles in pH level determination.
## shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared
## 110 0.01 7 10 1000 0.1101819 0.5955535
## MAE RMSESD RsquaredSD MAESD
## 110 0.08304377 0.009441458 0.06105948 0.005108199
## RMSE Rsquared MAE
## 0.10548388 0.64448254 0.07928038
Variable importance
Overall
mnf_flow 328.40 brand_code.C 101.37 alch_rel 94.79 oxygen_filler 87.81 usage_cont 87.32 temperature 75.56 carb_pressure1 59.25 pressure_vacuum 56.63 carb_rel 55.83 setpoint_diff 54.23 balling_lvl 47.54 air_pressurer 46.17 carb_flow 45.93 filler_speed 45.79 balling 43.74 hyd_pressure3 37.20 density 35.93 mfr 34.59 hyd_pressure2 25.87 pc_volume_pressure 25.66
## gbm variable importance
##
## only 20 most important variables shown (out of 31)
##
## Overall
## mnf_flow 328.40
## brand_code.C 101.37
## alch_rel 94.79
## oxygen_filler 87.81
## usage_cont 87.32
## temperature 75.56
## carb_pressure1 59.25
## pressure_vacuum 56.63
## carb_rel 55.83
## setpoint_diff 54.23
## balling_lvl 47.54
## air_pressurer 46.17
## carb_flow 45.93
## filler_speed 45.79
## balling 43.74
## hyd_pressure3 37.20
## density 35.93
## mfr 34.59
## hyd_pressure2 25.87
## pc_volume_pressure 25.66
Random Forest
Random Forest models, like GBM, are tree-based and do not require data scaling. The data, however, is scaled and centered to ensure that the preparation methods are consistent across all models used in the research.
During training, Random Forest generates numerous decision trees and outputs the mode of the classes (classification) or mean prediction (regression) of each tree. It is particularly effective for classification and regression, and its ensemble method allows it to handle overfitting very well.
Delivered the highest R-squared value of 0.7130, RMSE of 0.0963, and MAE of 0.0699, indicating excellent generalization.
Demonstrates significant contributions from operational variables like mnf_flow and oxygen_filler, aligning with expected impacts on pH.
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 5 10 0.1007928 0.6741332 0.07271081 0.00872876 0.05015008 0.003999654
## RMSE Rsquared MAE
## 0.09630017 0.71303966 0.06989033
Variable importance
Overall
mnf_flow 44.54 oxygen_filler 34.35 pressure_vacuum 31.16 carb_rel 29.64 air_pressurer 29.19 brand_code.C 28.84 temperature 28.52 filler_speed 28.06 balling_lvl 28.02 alch_rel 24.77 usage_cont 24.45 density 23.91 balling 23.79 carb_flow 22.90 setpoint_diff 21.78 carb_volume2 21.00 carb_pressure1 20.62 hyd_pressure3 18.45 filler_level 17.95 mfr 17.05
## rf variable importance
##
## only 20 most important variables shown (out of 31)
##
## Overall
## mnf_flow 44.54
## oxygen_filler 34.35
## pressure_vacuum 31.16
## carb_rel 29.64
## air_pressurer 29.19
## brand_code.C 28.84
## temperature 28.52
## filler_speed 28.06
## balling_lvl 28.02
## alch_rel 24.77
## usage_cont 24.45
## density 23.91
## balling 23.79
## carb_flow 22.90
## setpoint_diff 21.78
## carb_volume2 21.00
## carb_pressure1 20.62
## hyd_pressure3 18.45
## filler_level 17.95
## mfr 17.05
Resampling metrics (RMSE, R², and MAE) from model training are compared to test metrics to evaluate overfitting and generalization capabilities.Models are judged on their predicted accuracy, robustness, and computational efficiency. The final decision (e.g., Random Forest with the best test metrics) is based on a combination of these criteria, with the goal of producing a model that performs well in a variety of circumstances.
The chosen model should meet operational needs such as ease of deployment, stakeholder interpretability, and long-term maintenance. Model performance should be checked over time using new data to verify that it continues to operate as expected under changing settings.
Based on the resampling (cross-validation) metrics and test metrics for each model, we can perform a detailed analysis to select the best model for predicting pH values.
Random Forest (RF) outperforms practically all metrics, especially in
test metrics, where it has the lowest RMSE (0.0963) and the greatest
R-squared (0.7130). The MAE is likewise the lowest among the models, at
0.0699, showing that RF predictions are relatively close to real values
with less error on average.
Gradient Boosting Machine (GBM) follows closely behind RF, with
competitive scores, particularly in test R-squared (0.6445), indicating
its resilience in dealing with the dataset’s variation.
Neural Network (NNet) and Support Vector Machine (SVM) both perform
well, with NNet marginally outperforming in terms of resampling and
testing metrics, giving it a viable alternative if ensemble approaches
are eschewed.
While Multivariate Adaptive Regression Splines (MARS) are beneficial for capturing nonlinear interactions, their performance metrics fall behind those of the other models.
Because of their ensemble nature and iterative improvements, RF and GBM models may require careful adjustment to minimize overfitting. This category also includes NNet, which has a layered structure and the possibility for deep interactions. Compared to ensemble and neural network models, MARS and SVM are easier to interpret. If model explanation is important, these may be favored despite inferior performance. RF and GBM require more training time and resources than simpler models such as MARS or SVM. Given the findings, Random Forest is suggested for deployment due to its superior performance in all important measures. If there are limits on model transparency or processing resources, simpler models such as SVM or even NNet may be considered as alternatives. Further steps should include a rigorous validation process to confirm the RF model’s stability, an investigation into feature engineering based on RF variable relevance, and potential ensemble procedures that could combine strengths from many models to improve prediction accuracy.
| Model | Resample RMSE | Resample R² | Resample MAE | Test RMSE | Test R² | Test MAE |
|---|---|---|---|---|---|---|
| MARS | 0.1269369 | 0.4591680 | 0.0960094 | 0.1220811 | 0.5103157 | 0.0921927 |
| SVM | 0.1193689 | 0.5235771 | 0.0877122 | 0.1178720 | 0.5449518 | 0.0844574 |
| NNet | 0.1145209 | 0.5585798 | 0.0855949 | 0.1107385 | 0.5978824 | 0.0826644 |
| GBM | 0.1101819 | 0.5955535 | 0.0830438 | 0.1054839 | 0.6444825 | 0.0792804 |
| RF | 0.1007928 | 0.6741332 | 0.0727108 | 0.0963002 | 0.7130397 | 0.0698903 |
Save results to excel file
Structure of the Non-Technical Report
Executive Summary
Objective: Briefly describe the purpose of the analysis, which is to
understand and predict pH levels in the manufacturing process as per new
regulations.
Key Findings: Summarize the main outcomes, such as the most important
factors affecting pH and the performance of the predictive model.
Introduction
Background: Provide context about why pH is a critical quality metric in
beverage production and the impact of regulatory requirements.
Project Scope: Outline what the project covers, including data sources,
main analytical goals, and the expected impact on the production
process.
Data Overview
Data Sources: Describe where the data comes from and what it represents
(production parameters, quality measures, etc.).
Data Quality: Highlight the cleanliness of the data, any challenges with
missing values or outliers, and how these were addressed.
Insights from Data Exploration
Descriptive Statistics: Discuss general trends observed in the data,
such as typical pH levels and variability across different brand
codes.
Visuals and Interpretations: Use simple graphs like bar charts or line
graphs to show important trends and patterns that inform about the pH
level. For instance, show how pH varies by different manufacturing
parameters without getting into statistical jargon.
Predictive Modeling
Approach: Explain that various models were tested to find the best way
to predict pH levels, focusing on the usefulness of the models in making
predictions that help comply with regulations.
Model Selection: Discuss why the chosen model (e.g., Random Forest) was
suitable, focusing on its ability to provide accurate and reliable
predictions rather than the technical details.
Implications and Recommendations
Operational Impact: Discuss how the model can be used in daily
operations to ensure compliance and optimize production processes.
Future Steps: Suggest how the analysis could be extended or improved,
such as collecting more data over time or integrating real-time data
analytics into the production line.
Conclusion
Summary of Benefits: Reiterate the benefits of the predictive modeling
project, emphasizing improved regulatory compliance and enhanced
production efficiency.
Call to Action: Encourage stakeholders to consider the recommendations
and support further steps to integrate analytics into production
practices.
## ----setup, message=FALSE, warning=FALSE, include=FALSE-------------------------------------------------------------------------------------------------------------------
#chunks
knitr::opts_chunk$set(eval=TRUE, message=FALSE, warning=FALSE, echo=FALSE, fig.height=5, fig.align='center')
#libraries
library(tidyverse)
library(DMwR)
library(xgboost)
library(vip)
library(summarytools)
library(fpp3)
library(readxl)
library(curl)
library(latex2exp)
library(seasonal)
library(GGally)
library(gridExtra)
library(doParallel)
library(reshape2)
library(Hmisc)
library(corrplot)
library(e1071)
library(caret)
library(VIM)
library(rpart)
library(forecast)
library(urca)
library(earth)
library(glmnet)
library(cluster)
library(kernlab)
library(aTSA)
library(AppliedPredictiveModeling)
library(mlbench)
library(randomForest)
library(party)
library(gbm)
library(Cubist)
library(partykit)
library(kableExtra)
library(factoextra)
library(FactoMineR)
library(naniar)
library(mice)
library(janitor)
library(writexl)
## ----load_data------------------------------------------------------------------------------------------------------------------------------------------------------------
#Load train data
#URL the raw .xlsx file
url <- "https://raw.githubusercontent.com/ex-pr/project_2_DATA624/main/StudentData.xlsx"
#Temporary path
temp_file <- tempfile(fileext = ".xlsx")
#Download file
curl_download(url, temp_file)
#Read file
data_beverage <- read_excel(temp_file)
#Copy data
beverage_df <- data_beverage
#Clean temp file
unlink(temp_file)
#Load test data
#URL the raw .xlsx file
url <- "https://raw.githubusercontent.com/ex-pr/project_2_DATA624/main/StudentEvaluation.xlsx"
#Temporary path
temp_file <- tempfile(fileext = ".xlsx")
#Download file
curl_download(url, temp_file)
#Read file
data_eval <- read_excel(temp_file)
#Copy data
eval_df <- data_eval
## ----summary_statistics---------------------------------------------------------------------------------------------------------------------------------------------------
#Check first rows of data
DT::datatable(
beverage_df[1:10,],
options = list(scrollX = TRUE,
deferRender = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
info = FALSE,
paging=FALSE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 1: First 10 Rows of Beverage Data'
))
DT::datatable(
eval_df[1:10,],
options = list(scrollX = TRUE,
deferRender = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
info = FALSE,
paging=FALSE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 2: First 10 Rows of Evaluation Data'
))
#Summary statistics
DT::datatable(
dfSummary(beverage_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
extensions = c('Scroller'),
options = list(scrollY = 350,
scrollX = 500,
deferRender = TRUE,
scroller = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 3: Summary Statistics for Beverage Data'
))
#Summary statistics
DT::datatable(
dfSummary(eval_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
extensions = c('Scroller'),
options = list(scrollY = 350,
scrollX = 500,
deferRender = TRUE,
scroller = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 4: Summary Statistics for Evaluation Data'
))
stat <- beverage_df %>%
group_by(`Brand Code`) %>%
filter(!is.na(`Brand Code`)) %>%
dplyr::summarize(
Min = min(PH, na.rm = TRUE),
Q1 = quantile(PH, 0.25, na.rm = TRUE),
Median = median(PH, na.rm = TRUE),
Mean = mean(PH, na.rm = TRUE),
Q3 = quantile(PH, 0.75, na.rm = TRUE),
Max = max(PH, na.rm = TRUE),
StdDev = sd(PH, na.rm = TRUE),
Count = n(),
Missing = sum(is.na(PH))
)
#Summary statistics by code
DT::datatable(
stat,
options = list(dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE,
paging=FALSE,
info = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 5: Summary Statistics PH for Each Brand Code'
)) %>%
DT::formatRound(columns = c("Min", "Q1", "Median", "Mean", "Q3", "Max", "StdDev"), digits = 3)
## ----ph_plots-------------------------------------------------------------------------------------------------------------------------------------------------------------
#Distribution PH plot
ggplot(beverage_df, aes(x = PH)) +
geom_histogram(binwidth = 0.1, fill = "lightgreen", color = "black") +
theme_minimal() +
labs(title = "Distribution of PH", x = "PH", y = "Frequency")
#Distribution Brand Code plot
ggplot(beverage_df, aes(x = `Brand Code`)) +
geom_bar(fill = "lightgreen") +
theme_minimal() +
labs(title = "Count by Brand Code", x = "Brand Code", y = "Count")
filtered_df <- beverage_df %>%
filter(!is.na(`Brand Code`))
#Boxplot brand code vs ph
ggplot(filtered_df, aes(x = `Brand Code`, y = PH, fill = `Brand Code`)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot of PH by Brand Code", x = "Brand Code", y = "PH")
## ----group_plots----------------------------------------------------------------------------------------------------------------------------------------------------------
group_1 <- c("Carb Pressure", "Carb Pressure1", "Carb Temp", "Temperature",
"Usage cont", "Fill Pressure", "Air Pressurer", "Alch Rel",
"Balling", "Balling Lvl")
group_2 <- c("Carb Volume", "Carb Rel", "Fill Ounces", "Oxygen Filler", "Density",
"PC Volume", "PSC", "PSC CO2",
"PSC Fill", "Pressure Setpoint", "Pressure Vacuum")
group_3 <- c("Mnf Flow", "Carb Flow", "Filler Level", "Filler Speed", "Hyd Pressure1",
"Hyd Pressure2", "Hyd Pressure3", "Hyd Pressure4", "MFR", "Bowl Setpoint")
#Group 1 plot
beverage_df %>%
select(all_of(group_1)) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 0.3, fill = "lightgreen", color = "black") +
facet_wrap(~name, scales = "free", ncol = 5) +
theme_minimal() +
labs(title = "Distribution of Variables: Group 1")
# Group 2 Plot
beverage_df %>%
select(all_of(group_2)) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 0.05, fill = "lightgreen", color = "black") +
facet_wrap(~name, scales = "free", ncol = 5) +
theme_minimal() +
labs(title = "Distribution of Variables: Group 2")
# Group 3 Plot
beverage_df %>%
select(all_of(group_3)) %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 1, fill = "lightgreen", color = "black") +
facet_wrap(~name, scales = "free", ncol = 5) +
theme_minimal() +
labs(title = "Distribution of Variables: Group 3")
## ----na_scatterplt_ph_vs_features-----------------------------------------------------------------------------------------------------------------------------------------
#NA
gg_miss_var(beverage_df, show_pct = TRUE)
#Choose numeric vars
numeric_vars <- beverage_df %>%
select(where(is.numeric)) %>%
names()
#Create scatter plots vs PH
for (var in numeric_vars) {
p <- ggplot(beverage_df, aes_string(x = paste0("`", var, "`"), y = "PH")) +
geom_point(alpha = 0.6, color = "blue") +
theme_minimal() +
labs(title = paste("Scatterplot:", var, "vs PH"), x = var, y = "PH")
print(p) # Display the plot
}
## ----corr-----------------------------------------------------------------------------------------------------------------------------------------------------------------
tst <- beverage_df %>%
select(where(is.numeric))
#Correlation with PH
cor_table <- cor(drop_na(tst))[, "PH"] %>%
as.data.frame() %>%
rownames_to_column(var = "Variable") %>%
rename(Coefficient = ".") %>%
arrange(desc(Coefficient))
kable(cor_table, "html", escape = F, col.names = c('Variable', 'Coefficient')) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T) %>%
add_header_above(c("Table 6: Correlation with PH" = 2))
#Corr matrix
rcore <- rcorr(as.matrix(tst))
coeff <- rcore$r
#Filter to include only |r| > 0.1, the rest are 0
filtered_coeff <- coeff
filtered_coeff[abs(filtered_coeff) < 0.1] <- 0
coeff <- rcore$r
corrplot(filtered_coeff, tl.cex = .6, tl.col="black", method = 'color', addCoef.col = "black",
type="upper", order="hclust", number.cex=0.45, diag=FALSE)
## ----colnames_zeroVar-----------------------------------------------------------------------------------------------------------------------------------------------------
#Fix column names
colnames(beverage_df) <- make_clean_names(colnames(beverage_df))
#Apply the same to eval_df
colnames(eval_df) <- make_clean_names(colnames(eval_df))
#Check new column names
colnames(beverage_df)
# Identify zero-variance predictors
zero_var <- nearZeroVar(beverage_df, saveMetrics = TRUE)
print(zero_var[zero_var$nzv, ])
beverage_df <- beverage_df[, !zero_var$nzv]
eval_df <- eval_df[, !zero_var$nzv]
## ----impute_numeric-------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(547)
#KNN imputation
beverage_imputed <- as.data.frame(beverage_df)
beverage_imputed <- knnImputation(beverage_imputed[, !names(beverage_imputed) %in% c("brand_code", "mnf_flow")])
beverage_imputed$brand_code <- beverage_df$brand_code
beverage_imputed$mnf_flow <- beverage_df$mnf_flow
#KNN imputation mnf_flow
beverage_imputed$mnf_flow[beverage_imputed$mnf_flow == -100] <- NA
beverage_imputed <- knnImputation(beverage_imputed[, !names(beverage_imputed) %in% c("brand_code")], k = 5, scale = TRUE, meth = 'median')
beverage_imputed$brand_code <- beverage_df$brand_code
eval_imputed <- as.data.frame(eval_df)
eval_imputed <- knnImputation(eval_imputed[, !names(eval_imputed) %in% c("brand_code", "ph")])
eval_imputed$brand_code <- eval_df$brand_code
eval_imputed$ph <- eval_df$ph
#eval_df doesnt have na in mnf_flow
colSums(is.na(beverage_imputed))
colSums(is.na(eval_imputed))
## ----brand_Code_impute----------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(547)
impute_rf <- function(data, target_var, exclude_vars = NULL) {
#Ensure target_var is a factor
data[[target_var]] <- factor(data[[target_var]])
#Exclude specified variables from the model input
if (!is.null(exclude_vars)) {
model_data <- data[, !(names(data) %in% exclude_vars)]
} else {
model_data <- data
}
#Train model on non-NA data for the target variable
rf_model <- randomForest(reformulate(".", target_var), data = model_data[!is.na(model_data[[target_var]]),], na.action = na.omit)
#Predict NAs
na_indices <- is.na(data[[target_var]])
if (sum(na_indices) > 0) {
predicted_values <- predict(rf_model, newdata = model_data[na_indices,])
data[[target_var]][na_indices] <- predicted_values
}
return(data)
}
beverage_imputed <- impute_rf(beverage_imputed, "brand_code")
eval_imputed <- impute_rf(eval_imputed, "brand_code", exclude_vars = "ph")
#Verify
colSums(is.na(beverage_imputed))
colSums(is.na(eval_imputed))
## ----encode_categorical---------------------------------------------------------------------------------------------------------------------------------------------------
#One-hot Encoding for Brand Code
beverage_imputed$brand_code <- as.factor(beverage_imputed$brand_code)
eval_imputed$brand_code <- as.factor(eval_imputed$brand_code)
#Create dummies
dummy_vars <- dummyVars(~ brand_code, data = beverage_imputed, fullRank = TRUE)
beverage_imputed <- cbind(beverage_imputed, predict(dummy_vars, newdata = beverage_imputed))
beverage_imputed <- beverage_imputed %>% select(-brand_code)
eval_imputed <- cbind(eval_imputed, predict(dummy_vars, newdata = eval_imputed))
eval_imputed <- eval_imputed %>% select(-brand_code)
## ----new_features---------------------------------------------------------------------------------------------------------------------------------------------------------
#Creating Interaction Terms
beverage_revised <- beverage_imputed %>%
select(-contains('_interaction'), -contains('composite'), -contains('efficiency'), -contains('index')) %>%
mutate(
setpoint_diff = bowl_setpoint - pressure_setpoint, # Example of feature engineering
carb_volume2 = carb_volume^2, # Polynomial feature for carb_volume
pc_volume_pressure = pc_volume * carb_pressure # Interaction feature
)
eval_revised <- eval_imputed %>%
select(-contains('_interaction'), -contains('composite'), -contains('efficiency'), -contains('index')) %>%
mutate(
setpoint_diff = bowl_setpoint - pressure_setpoint, # Example of feature engineering
carb_volume2 = carb_volume^2, # Polynomial feature for carb_volume
pc_volume_pressure = pc_volume * carb_pressure # Interaction feature
)
#Remove original columns to reduce collinearity
beverage_revised <- beverage_revised %>%
select(-c(bowl_setpoint, pressure_setpoint, carb_volume, pc_volume, carb_pressure))
eval_revised <- eval_revised %>%
select(-c(bowl_setpoint, pressure_setpoint, carb_volume, pc_volume, carb_pressure))
# Creating a binned feature for another variable, e.g., carb_temp
beverage_revised$carb_temp_binned <- cut(beverage_revised$carb_temp,
breaks=quantile(beverage_revised$carb_temp, probs=seq(0, 1, 0.25)),
include.lowest=TRUE,
labels=FALSE)
eval_revised$carb_temp_binned <- cut(eval_revised$carb_temp,
breaks=quantile(eval_revised$carb_temp, probs=seq(0, 1, 0.25)),
include.lowest=TRUE,
labels=FALSE)
# Optionally, remove original carb_temp if the binned version is preferred
beverage_revised <- beverage_revised %>%
select(-carb_temp)
eval_revised <- eval_revised %>%
select(-carb_temp)
## ----transform_vars-------------------------------------------------------------------------------------------------------------------------------------------------------
# Apply transformations
set.seed(123)
# Log transformations
beverage_transformed <- beverage_imputed
beverage_transformed$filler_speed <- log(beverage_transformed$filler_speed)
beverage_transformed$mfr <- log(beverage_transformed$mfr)
# Square root transformations
beverage_transformed$temperature <- sqrt(beverage_transformed$temperature)
beverage_transformed$air_pressurer <- sqrt(beverage_transformed$air_pressurer)
# Shifted log transformations
beverage_transformed$oxygen_filler <- log(beverage_transformed$oxygen_filler + 1)
beverage_transformed$psc_co2 <- log(beverage_transformed$psc_co2 + 1)
# Log transformations
eval_transformed <- eval_imputed
eval_transformed$filler_speed <- log(eval_imputed$filler_speed)
eval_transformed$mfr <- log(eval_imputed$mfr)
# Square root transformations
eval_transformed$temperature <- sqrt(eval_transformed$temperature)
eval_transformed$air_pressurer <- sqrt(eval_transformed$air_pressurer)
# Shifted log transformations
eval_transformed$oxygen_filler <- log(eval_transformed$oxygen_filler + 1)
eval_transformed$psc_co2 <- log(eval_transformed$psc_co2 + 1)
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Summary statistics
DT::datatable(
dfSummary(beverage_revised, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
extensions = c('Scroller'),
options = list(scrollY = 350,
scrollX = 500,
deferRender = TRUE,
scroller = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 7: Summary Statistics for Transformed Beverage Data'
))
#Summary statistics
DT::datatable(
dfSummary(eval_revised, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE),
extensions = c('Scroller'),
options = list(scrollY = 350,
scrollX = 500,
deferRender = TRUE,
scroller = TRUE,
dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 8: Summary Statistics for Transformed Evaluation Data'
))
## ----parallel_core--------------------------------------------------------------------------------------------------------------------------------------------------------
#Detect the number of available cores
numCores <- detectCores()
#Register parallel backend (use numCores - 1 to leave 1 core free for system tasks)
cl <- makeCluster(numCores - 1)
registerDoParallel(cl)
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Data with new features
set.seed(123)
trainIndex <- createDataPartition(beverage_revised$ph, p = 0.8, list = FALSE)
train_revised <- beverage_revised[trainIndex, ]
test_revised <- beverage_revised[-trainIndex, ]
eval_revised <- subset(eval_revised, select = -ph)
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
trainIndex <- createDataPartition(beverage_transformed$ph, p = 0.8, list = FALSE)
train_transformed <- beverage_transformed[trainIndex, ]
test_transformed <- beverage_transformed[-trainIndex, ]
eval_transformed <- subset(eval_transformed, select = -ph)
#Setup cv
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, allowParallel = TRUE)
## ----mars_model-----------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
marsGrid <- expand.grid(degree = 1:2, nprune = 2:20)
#Train MARS
marsModel <- train(ph ~ ., data = train_transformed,
method = "earth",
preProc = c("center", "scale"),
tuneGrid = marsGrid,
trControl = control
)
#degree nprune RMSE Rsquared MAE
# 2 20 0.1269369 0.459168 0.0960094
mars_tune <- marsModel$results[marsModel$results$nprune == marsModel$bestTune$nprune & marsModel$results$degree == marsModel$bestTune$degree, ]
mars_tune
#Predict
marsPred <- predict(marsModel, newdata = test_transformed)
#RMSE Rsquared MAE
#0.12208106 0.51031566 0.09219265
mars_results <- postResample(pred = marsPred, obs = test_transformed$ph)
mars_results
## ----mars_varimpt---------------------------------------------------------------------------------------------------------------------------------------------------------
importance_mars <- varImp(marsModel, scale = FALSE)
importance_mars
plot(importance_mars, top = 10, main = "Top 10 predictors, MARS Model")
## ----svm_model------------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
svmGrid <- expand.grid(sigma = c(0.001, 0.01, 0.1), C = c(0.1, 1, 10, 100))
#Train SVM
svmModel <- train(ph ~ ., data = train_transformed,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmGrid,
trControl = control
)
#sigma C RMSE Rsquared MAE
# 0.01 10 0.1193689 0.5235771 0.08771225
svm_tune <- svmModel$results[svmModel$results$sigma == svmModel$bestTune$sigma & svmModel$results$C == svmModel$bestTune$C, ]
svm_tune
#Predict
svmPred <- predict(svmModel, newdata = test_transformed)
#RMSE Rsquared MAE
#0.11787203 0.54495184 0.08445738
svm_results <- postResample(pred = svmPred, obs = test_transformed$ph)
svm_results
## ----svm_varimpt----------------------------------------------------------------------------------------------------------------------------------------------------------
importance_svm <- varImp(svmModel, scale = FALSE)
importance_svm
plot(importance_svm, top = 10, main = "Top 10 predictors, SVM Model")
## ----nnet_model-----------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
nnetGrid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = 1:10, .bag = FALSE)
#Train avNNet
nnetModel <- train(ph ~ ., data = train_transformed,
method = "avNNet",
preProc = c("center", "scale"),
tuneGrid = nnetGrid,
trControl = control,
linout = TRUE,
trace = FALSE)
#decay size bag RMSE Rsquared MAE
# 0 10 FALSE 0.1145209 0.5585798 0.0855949
nnet_tune <- nnetModel$results[nnetModel$results$size == nnetModel$bestTune$size & nnetModel$results$decay == nnetModel$bestTune$decay, ]
nnet_tune
#Predict
nnetPred <- predict(nnetModel, newdata = test_transformed)
#RMSE Rsquared MAE
#0.11073852 0.59788239 0.08266435
nnet_results <- postResample(pred = nnetPred, obs = test_transformed$ph)
nnet_results
## ----nnet_varimpt---------------------------------------------------------------------------------------------------------------------------------------------------------
importance_nnet <- varImp(nnetModel, scale = FALSE)
importance_nnet
plot(importance_nnet, top = 10, main = "Top 10 predictors, avNNet Model")
## ----gbm_model------------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
gbmGrid <- expand.grid(.n.trees = seq(100, 1000, by = 100), .interaction.depth = seq(1, 7, by = 2), .shrinkage = 0.01, .n.minobsinnode = c(5, 10, 15))
#Train GBM
gbm_model <- train(ph ~ ., data = train_revised,
method = "gbm",
preProc = c("center", "scale"),
tuneGrid = gbmGrid,
trControl = control,
verbose = FALSE)
#shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared MAE
# 0.01 7 10 1000 0.1101819 0.5955535 0.08304377
gbm_tune <- gbm_model$results[gbm_model$results$n.trees == gbm_model$bestTune$n.trees & gbm_model$results$interaction.depth == gbm_model$bestTune$interaction.depth & gbm_model$results$n.minobsinnode == gbm_model$bestTune$n.minobsinnode,]
gbm_tune
#Predict
gbm_predictions <- predict(gbm_model, newdata = test_revised)
# RMSE Rsquared MAE
#0.10548388 0.64448254 0.07928038
gbm_results <- postResample(pred = gbm_predictions, obs = test_revised$ph)
gbm_results
## ----gbm_varimpt----------------------------------------------------------------------------------------------------------------------------------------------------------
importance_gbm <- varImp(gbm_model, scale = FALSE)
importance_gbm
plot(importance_gbm, top = 10, main = "Top 10 predictors, GBM")
## ----rf_model-------------------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(123)
#Grid with parameters
rfGrid <- expand.grid(.mtry = c(2, 4, 6, 8, 10))
#Train rf
rf_model <- train(ph ~ ., data = train_revised,
method = "rf",
preProc = c("center", "scale"),
trControl = control,
tuneGrid = rfGrid,
importance = TRUE)
#mtry RMSE Rsquared MAE
# 10 0.1007928 0.6741332 0.07271081
rf_tune <- rf_model$results[rf_model$results$mtry == rf_model$bestTune$mtry, ]
rf_tune
#Predict
rf_pred <- predict(rf_model, newdata = test_revised)
#RMSE Rsquared MAE
#0.09630017 0.71303966 0.06989033
rf_results <- postResample(pred = rf_pred, obs = test_revised$ph)
rf_results
## ----rf_variimpt----------------------------------------------------------------------------------------------------------------------------------------------------------
importance_rf <- varImp(rf_model, scale = FALSE)
importance_rf
plot(importance_rf, top = 10, main = "Top 10 predictors, Random Forest")
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Stop parallel once youre done with models
stopCluster(cl)
## ----compare_models-------------------------------------------------------------------------------------------------------------------------------------------------------
#Create empty df
results <- data.frame(
Model = character(),
Resample_RMSE = numeric(),
Resample_R2 = numeric(),
Resample_MAE = numeric(),
Test_RMSE = numeric(),
Test_R2 = numeric(),
Test_MAE = numeric(),
stringsAsFactors = FALSE
)
#Fill df with results
results <- rbind(results, data.frame(Model = "MARS", Resample_RMSE = mars_tune$RMSE, Resample_R2 = mars_tune$Rsquared, Resample_MAE = mars_tune$MAE, Test_RMSE = mars_results[1], Test_R2 = mars_results[2], Test_MAE = mars_results[3]))
results <- rbind(results, data.frame(Model = "SVM", Resample_RMSE = svm_tune$RMSE, Resample_R2 = svm_tune$Rsquared, Resample_MAE = svm_tune$MAE, Test_RMSE = svm_results[1], Test_R2 = svm_results[2], Test_MAE = svm_results[3]))
results <- rbind(results, data.frame(Model = "NNet", Resample_RMSE = nnet_tune$RMSE, Resample_R2 = nnet_tune$Rsquared, Resample_MAE = nnet_tune$MAE, Test_RMSE = nnet_results[1], Test_R2 = nnet_results[2], Test_MAE = nnet_results[3]))
results <- rbind(results, data.frame(Model = "GBM", Resample_RMSE = gbm_tune$RMSE, Resample_R2 = gbm_tune$Rsquared, Resample_MAE = gbm_tune$MAE, Test_RMSE = gbm_results[1], Test_R2 = gbm_results[2], Test_MAE = gbm_results[3]))
results <- rbind(results, data.frame(Model = "RF", Resample_RMSE = rf_tune$RMSE, Resample_R2 = rf_tune$Rsquared, Resample_MAE = rf_tune$MAE, Test_RMSE = rf_results[1], Test_R2 = rf_results[2], Test_MAE = rf_results[3]))
row.names(results) <- NULL
kable(results, "html", escape = FALSE, align = c('l', rep('c', 6)), col.names = c("Model", "Resample RMSE", "Resample R²", "Resample MAE", "Test RMSE", "Test R²", "Test MAE")) %>%
kable_styling("striped", full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
add_header_above(c(" " = 1, "Resample Metrics" = 3, "Test Metrics" = 3),
bold = TRUE, background = "#D3D3D3", color = "black")
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Predictions for the evaluation dataset
eval_pred <- predict(rf_model, newdata = eval_revised)
#Create df
eval_results <- data.frame(
ph_pred = eval_pred
)
eval_revised_pred <- eval_revised
eval_revised_pred$ph_pred <- eval_pred
#Save to Excel
write_xlsx(eval_revised_pred, "eval_revised_pred.xlsx")
write_xlsx(eval_results, "ph_predictions.xlsx")
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#average ph
eval_revised_pred %>%
group_by(brand_code) %>%
summarise(`Average PH` = mean(PH))
stat <- eval_revised_pred %>%
group_by(`Brand Code`) %>%
filter(!is.na(`Brand Code`)) %>%
dplyr::summarize(
Min = min(PH, na.rm = TRUE),
Q1 = quantile(PH, 0.25, na.rm = TRUE),
Median = median(PH, na.rm = TRUE),
Mean = mean(PH, na.rm = TRUE),
Q3 = quantile(PH, 0.75, na.rm = TRUE),
Max = max(PH, na.rm = TRUE),
StdDev = sd(PH, na.rm = TRUE),
Count = n(),
Missing = sum(is.na(PH))
)
#Summary statistics by code
DT::datatable(
stat,
options = list(dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE,
paging=FALSE,
info = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 5: Summary Statistics PH for Each Brand Code'
)) %>%
DT::formatRound(columns = c("Min", "Q1", "Median", "Mean", "Q3", "Max", "StdDev"), digits = 3)
stat <- beverage_df %>%
group_by(`Brand Code`) %>%
filter(!is.na(`Brand Code`)) %>%
dplyr::summarize(
Min = min(PH, na.rm = TRUE),
Q1 = quantile(PH, 0.25, na.rm = TRUE),
Median = median(PH, na.rm = TRUE),
Mean = mean(PH, na.rm = TRUE),
Q3 = quantile(PH, 0.75, na.rm = TRUE),
Max = max(PH, na.rm = TRUE),
StdDev = sd(PH, na.rm = TRUE),
Count = n(),
Missing = sum(is.na(PH))
)
#Summary statistics by code
DT::datatable(
stat,
options = list(dom = 'lBfrtip',
fixedColumns = TRUE,
searching = FALSE,
paging=FALSE,
info = FALSE),
rownames = FALSE,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 16px; font-weight: bold;',
'Table 5: Summary Statistics PH for Each Brand Code'
)) %>%
DT::formatRound(columns = c("Min", "Q1", "Median", "Mean", "Q3", "Max", "StdDev"), digits = 3)
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
library(knitr)
purl("daria_draft.Rmd", output = "extracted_code.R")