Predicting the PH Of Beverages

Introduction:

As a data scientist tasked with developing a predictive model for pH regulation in manufacturing processes, the overarching objective is to leverage data-driven insights to enhance operational efficiency, ensure regulatory compliance, and optimize product quality. By harnessing advanced analytics and machine learning techniques, we aim to empower decision-makers with actionable insights that enable proactive management of pH fluctuations throughout the production cycle.

Goal:

The goal of this project is to identify relevant features and engineer variables that capture the predictive factors influencing pH variations. Utilize statistical methods and domain expertise to select the most informative features for model development. We will also try to develop and train predictive models using appropriate algorithms such as regression, time series analysis, or machine learning classifiers. Employ rigorous evaluation techniques, including cross-validation and performance metrics, to assess model accuracy and generalization ability. Our goal is also to ensure transparency and interpret-ability of the predictive model by employing techniques such as feature importance analysis, model visualization, and explanation methods to elucidate the factors driving pH predictions. We will also suggest how to integrate the predictive model with existing decision support systems or manufacturing control systems to enable automated decision-making based on pH forecasts. Facilitate seamless communication between the data science team and operational stakeholders.

By pursuing these goals, we aim to establish a robust data science framework for pH predictive modeling that not only meets regulatory obligations but also drives innovation, efficiency, and excellence in manufacturing operations.

Data Acquisition and Exploration:

In this phase we will collect and integrate diverse data sets encompassing raw materials, process parameters, environmental conditions, and historical pH measurements to build a comprehensive understanding of the manufacturing process. In our case we were provided with a data set but obviously the data set provided was not ready to be fed to an algorithm to train a model. The data set provided was in .xlsx format and we stored on remote location i.e. Github repository, for reproducibility in a comma separated values (.csv) format. The below code chunk loads the provided data set into our environment where we can preprocess and make the data ready for the algorithms.

Loading The Dataset:

df <- read_csv("https://raw.githubusercontent.com/Umerfarooq122/Using-Predictive-analytics-to-predict-PH-of-beverages/main/StudentData%20-%20Copy.csv")

Data Exploration:

Let’s check out the first few rows of the data set to confirm that all the features and observations are loaded correctly.

flextable(head(df))

Brand Code	Carb Volume	Fill Ounces	PC Volume	Carb Pressure	Carb Temp	PSC	PSC Fill	PSC CO2	Mnf Flow	Carb Pressure1	Fill Pressure	Hyd Pressure2	Hyd Pressure3	Hyd Pressure4	Filler Level	Filler Speed	Temperature	Usage cont	Carb Flow	Density	MFR	Balling	Pressure Vacuum	PH	Oxygen Filler	Bowl Setpoint	Pressure Setpoint	Air Pressurer	Alch Rel	Carb Rel	Balling Lvl
B	5.340000	23.96667	0.2633333	68.2	141.2	0.104	0.26	0.04	-100	118.8	46.0			118	121.2	4,002	66.0	16.18	2,932	0.88	725.0	1.398	-4.0	8.36	0.022	120	46.4	142.6	6.58	5.32	1.48
A	5.426667	24.00667	0.2386667	68.4	139.6	0.124	0.22	0.04	-100	121.6	46.0			106	118.6	3,986	67.6	19.90	3,144	0.92	726.8	1.498	-4.0	8.26	0.026	120	46.8	143.0	6.56	5.30	1.56
B	5.286667	24.06000	0.2633333	70.8	144.8	0.090	0.34	0.16	-100	120.2	46.0			82	120.0	4,020	67.0	17.76	2,914	1.58	735.0	3.142	-3.8	8.94	0.024	120	46.6	142.0	7.66	5.84	3.28
A	5.440000	24.00667	0.2933333	63.0	132.6		0.42	0.04	-100	115.2	46.4	0	0	92	117.8	4,012	65.6	17.42	3,062	1.54	730.6	3.042	-4.4	8.24	0.030	120	46.0	146.2	7.14	5.42	3.04
A	5.486667	24.31333	0.1113333	67.2	136.8	0.026	0.16	0.12	-100	118.4	45.8	0	0	92	118.6	4,010	65.6	17.68	3,054	1.54	722.8	3.042	-4.4	8.26	0.030	120	46.0	146.2	7.14	5.44	3.04
A	5.380000	23.92667	0.2693333	66.6	138.4	0.090	0.24	0.04	-100	119.6	45.6	0	0	116	120.2	4,014	66.2	23.82	2,948	1.52	738.8	2.992	-4.4	8.32	0.024	120	46.0	146.6	7.16	5.44	3.02

We can see that everything loaded correctly let’s check the structure of the data set to confirm that data types of the features.

glimpse(df)

## Rows: 2,571
## Columns: 33
## $ `Brand Code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…

We can see that most of the data did have the right format apart from the feature Brand Code. which needs to be fixed. Everything else looks good so far when it comes to data type but we can observe that some of the data has missing values when it comes to few variables or features like MFR,Brand Code, and Filler Speed e.t.c

Descriptive Summary:

summary(df)

##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

Visual Analytics of Features:

Since we have a lot of numeric predictors so let’s plot the histogram to see how they are distributed.

df %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = "free") +
  ggtitle("Histograms of Numerical Predictors")

As we can see that predictors like Carb Pressure, Carb Temp, Carb Volume, Pressure Vaccum, PC Volume, are Fill Ounces are almost normally distributed and so does our target variable PH. Since our aim is to predict the PH which is a continuous number so we are dealing with a regression problem which in turn mean that we can apply linear regression. Application of linear regression should give us good results but we can see that most of the predictors that are not linearly distributed and it can be transformed to get good results but we might not spend that much time on each predictor if the linear regression model required a lot of improvement. At that point we might use other regression techniques like multivariate adaptive regression splines (MARS), Support vector machines (SVM), Boosted Trees, Random Forest and Neural Networks e.t.c.

Let’s check out the Brand Code column now but we need to fix the data type before we plot the column.

df$`Brand Code` <- as.factor(df$`Brand Code`)
df %>%
  ggplot() + 
  geom_bar(aes(x = `Brand Code`)) + 
  ggtitle("Distribution of the Brand Codes")

we can see that we have four different brand codes i.e. A,B,C and D. A lot of observations are from brand code B. We can also see that we have a bunch of missing values in our brand codes. In the next section we will take care of the missing values in our Data set.

Pre-processing Dataset:

In previous section we saw that we had a lot missing values in certain columns and in this section we will use different techniques to impute those missing values. Before imputation let’s check out the number of missing values in each column.

Checking Missing Values:

df %>%
  summarise_all(list(~ sum(is.na(.)))) %>%
  gather(variable, value) %>%
  filter(value != 0) %>%
  arrange(-value)

## # A tibble: 31 × 2
##    variable       value
##    <chr>          <int>
##  1 MFR              212
##  2 Brand Code       120
##  3 Filler Speed      57
##  4 PC Volume         39
##  5 PSC CO2           39
##  6 Fill Ounces       38
##  7 PSC               33
##  8 Carb Pressure1    32
##  9 Hyd Pressure4     30
## 10 Carb Pressure     27
## # ℹ 21 more rows

We can see that our MFR column has the highest number of missing values which almost equal to 8.24% of the total observations. In general it preferred by a lot of data scientist to remove columns that have missing more than 5% of the total observations. Brand Code also has missing values of around 4.6% of total observations. Now in order to impute these missing values we would use mice package from R but before that we have to fix the column names and remove the spacing in between the names since it is not compatible with mice package functions.

Fixing Column Names:

In this section we will create a list to hold the column names and then use that list to replace all the column names in the data frame.

names<- c("Brand_Code", "Carb_Volume", "Fill_Ounces","PC_Volume", "Carb_Pressure", "Carb_Temp","PSC" ,  "PSC_Fill", "PSC_CO2", "Mnf_Flow",  "Carb_Pressure1", "Fill_Pressure", "Hyd_Pressure1", "Hyd_Pressure2", "Hyd_Pressure3", "Hyd_Pressure4", "Filler_Level", "Filler_Speed" , "Temperature", "Usage_cont", "Carb_Flow", "Density", "MFR",  "Balling", "Pressure_Vacuum", "PH", "Oxygen_Filler", "Bowl_Setpoint",  
 "Pressure_Setpoint", "Air_Pressurer", "Alch_Rel" , "Carb_Rel", "Balling_Lvl")

colnames(df) <- names

Now we replace the column names and we can go ahead the check the first rows of the columns

flextable(head(df))

Brand_Code	Carb_Volume	Fill_Ounces	PC_Volume	Carb_Pressure	Carb_Temp	PSC	PSC_Fill	PSC_CO2	Mnf_Flow	Carb_Pressure1	Fill_Pressure	Hyd_Pressure2	Hyd_Pressure3	Hyd_Pressure4	Filler_Level	Filler_Speed	Temperature	Usage_cont	Carb_Flow	Density	MFR	Balling	Pressure_Vacuum	PH	Oxygen_Filler	Bowl_Setpoint	Pressure_Setpoint	Air_Pressurer	Alch_Rel	Carb_Rel	Balling_Lvl
B	5.340000	23.96667	0.2633333	68.2	141.2	0.104	0.26	0.04	-100	118.8	46.0			118	121.2	4,002	66.0	16.18	2,932	0.88	725.0	1.398	-4.0	8.36	0.022	120	46.4	142.6	6.58	5.32	1.48
A	5.426667	24.00667	0.2386667	68.4	139.6	0.124	0.22	0.04	-100	121.6	46.0			106	118.6	3,986	67.6	19.90	3,144	0.92	726.8	1.498	-4.0	8.26	0.026	120	46.8	143.0	6.56	5.30	1.56
B	5.286667	24.06000	0.2633333	70.8	144.8	0.090	0.34	0.16	-100	120.2	46.0			82	120.0	4,020	67.0	17.76	2,914	1.58	735.0	3.142	-3.8	8.94	0.024	120	46.6	142.0	7.66	5.84	3.28
A	5.440000	24.00667	0.2933333	63.0	132.6		0.42	0.04	-100	115.2	46.4	0	0	92	117.8	4,012	65.6	17.42	3,062	1.54	730.6	3.042	-4.4	8.24	0.030	120	46.0	146.2	7.14	5.42	3.04
A	5.486667	24.31333	0.1113333	67.2	136.8	0.026	0.16	0.12	-100	118.4	45.8	0	0	92	118.6	4,010	65.6	17.68	3,054	1.54	722.8	3.042	-4.4	8.26	0.030	120	46.0	146.2	7.14	5.44	3.04
A	5.380000	23.92667	0.2693333	66.6	138.4	0.090	0.24	0.04	-100	119.6	45.6	0	0	116	120.2	4,014	66.2	23.82	2,948	1.52	738.8	2.992	-4.4	8.32	0.024	120	46.0	146.6	7.16	5.44	3.02

We can see that the column names have been updated.

Imputing Missing values:

Now that our data set is ready for the imputation we can go ahead and use mice package with Random Forest (RF). RF is particularly advantageous due to its capability to handle complex relationships, including non-linearities and interactions among variables. This robustness makes it suitable for data sets with mixed data types, as RF can seamlessly manage both continuous and categorical variables without requiring pre-processing. By employing an ensemble of decision trees, RF reduces bias compared to simpler imputation methods like mean or median imputation, yielding more accurate estimates for missing values. Moreover, RF provides insights into variable importance, aiding in variable selection and enhancing interpretability. Integrated with built-in cross-validation in the mice package, RF imputation optimizes model hyper-parameters, ensuring generalizability to unseen data. Overall, RF imputation in mice offers a versatile and effective approach to handling missing data, contributing to more robust statistical analyses and modeling.

set.seed(100)

df <- mice(df, m = 1, method = 'rf', print = FALSE) %>% complete()

After taking care of the missing values we will then remove the columns that have low to almost zero variance since it might not help us in the predictions. Removing zero variance columns from a dataset before modeling serves several important purposes. First and foremost, these columns, which contain no variability as all values are the same, offer no discriminatory power to the model, essentially conveying no useful information for classification or prediction tasks. By eliminating them, the efficiency of the modeling process is enhanced, as computational resources are not wasted on training with non-informative features. Additionally, excluding zero variance columns helps avoid potential issues with certain algorithms that may encounter numerical instabilities or errors when confronted with such features. Furthermore, simplifying the model’s interpretation becomes more feasible when irrelevant features are removed, allowing for a clearer understanding of the relationship between predictors and outcomes. This process also aids in reducing redundancy and multi-collinearity within the dataset, thereby improving the stability and performance of the model.

# filtering low frequencies
df <- df[, -nearZeroVar(df)]

In this case, only Hyd_Pressure1 was removed because of its low variance. Now that our data set is ready we can go ahead check the correlation among different columns and with our target column PH

Features Selection:

Correlation Among Features:

Since it is a regression problem so we will need to check the colinearity among the variables and we might use Linear regression too and for linear regression co-linearity and multi-colinearity does affect the performance of the model a lot. We can drop some weak predictors or predictors that highly corelated with other predictors too.

# Option 1: Removing non-numeric columns before calculating correlation
numeric_df <- df %>%
  select_if(is.numeric)

# Calculate correlation matrix and plot
correlation_matrix <- cor(numeric_df)
corrplot(correlation_matrix, title = "Correlation between Variables")

We can see that PH does not have a strong positive correlation with a lot of features. It has a moderate correlation with features like Filler_Level, Pressure_Vaccum, Oxygen_Filler and Bowl_Setpoint. We can also observe multi-colinearity among Density,Balling, Alch_Rel,Hyd_Pressure3 and Baling_Lvl e.t.c. Which can cause problems if we using a linear regression since multi-colinearity affects. We can counter that by carrying out principle component analysis (PCA). PCA will reduce the dimension and get rid of multi-colinearity but even then we would need other predictor to strongly correlate for linear regression. If we instead start to drop the column there is higher chance of getting a very low \(R^2\) which means that the variance in the data might not get captured by the linear regression. So we will try not remove any columns and in combination with linear regression we will try other models like Trees and regulation rules, Support vector machines, and Neural networks e.t.c. These algorithms are not limited by multi-colinearity. We might not get a good performance from neural network either since it requires a lot of data points and for this particular study we only have 2571 observations which be further cut down into training and testing data sets.

Model Development And Evaluation:

In this section we will develop our models. First we will split the data into 20-80. 20% for testing the algorithm’s performance while 80% for training the algorithm. Once we are done with the split we can go ahead and train our models. For this study we will use linear regression, robust linear regression, multivariate adaptive regression splines, support vector machines, decision tree, random forests, Cubist, boosting and neural networks

Splitting The Data:

set.seed(123)
index <- createDataPartition(df$PH, p = .8, list = FALSE)

# train 
train_x <- df[index, ] |> select(-PH)
train_y <- df[index, 'PH']

# test
test_x <- df[-index, ] |> select(-PH)
test_y <- df[-index, 'PH']

Linear Regression

ctrl <- trainControl(method = "cv", number = 10)
suppressMessages({
set.seed(123)
lmtuned <- train(train_x,train_y,  method = 'lm', preProcess = c("center","scale"), trControl = ctrl)

lmpred <- predict(lmtuned, test_x)
})
postResample(lmpred,test_y)

##      RMSE  Rsquared       MAE 
## 0.1305819 0.4284191 0.1008029

Robust Linear Regression

set.seed(123)
rlmPCA <- train(train_x, train_y, method = "rlm", preProcess = c("center","scale"), trControl = ctrl)

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

rlmpred <- predict(rlmPCA, test_x)

postResample(rlmpred, test_y)

##      RMSE  Rsquared       MAE 
## 0.1306904 0.4285156 0.1005828

Boosted Trees

set.seed(123)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)


gbmTune <- train(train_x,  train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 preProcess = c("center", "scale"),
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.10487097 0.63122799 0.07756929

Random Forest

postResample(rfPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.09381957 0.71779105 0.06765723

MARS

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)

# tune
marsTune <- train(train_x,train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  preProcess = c("center", "scale"),
                  trControl = trainControl(method = "cv"))

marsPred <- predict(marsTune, test_x)

postResample(marsPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.11711640 0.53950964 0.08767137

Cubist

set.seed(123)
cubistTuned <- train(train_x, train_y,
                     preProcess = c("center", "scale"), 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.09757177 0.68114666 0.06732976

Neural Network

postResample(nnPred, test_y)

##      RMSE  Rsquared       MAE 
## 0.1252349 0.4770767 0.0919553

Support Vector Machines

set.seed(123)

# tune
svmRTune <- train(train_x[, -1], train_y,
                  method = "svmRadial",
                  tuneLength = 14,
                  preProcess = c("center", "scale"),
                  trControl = trainControl(method = "cv"))

svmRPred <- predict(svmRTune, test_x[, -1])

postResample(svmRPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.11733152 0.53935746 0.08531385

Evaluation:

Model Selection:

In regression analysis, evaluating the performance of models is crucial for understanding how well they capture the relationship between predictor variables and the target variable. Commonly used metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) provide insights into the accuracy of predictions, while R-squared (R²) measures the proportion of variance explained by the model. Residual analysis and cross-validation techniques further aid in assessing model assumptions and generalization to new data. By considering these evaluation metrics and techniques, analysts can make informed decisions about model selection, refinement, and interpretation, ultimately ensuring the reliability and validity of regression models in capturing underlying patterns in the data.

rbind(LinearRegression = postResample(lmpred, test_y),
      RobustLinearRegression = postResample(rlmpred, test_y),
      NeuralNetwork = postResample(nnPred, test_y),
      MARS = postResample(marsPred, test_y),
      SVMR = postResample(svmRPred, test_y),
      RandomForest = postResample(rfPred, test_y),
      BoostedTrees = postResample(gbmPred, test_y),
      Cubist = postResample(cubistPred, test_y))

##                              RMSE  Rsquared        MAE
## LinearRegression       0.13058186 0.4284191 0.10080289
## RobustLinearRegression 0.13069044 0.4285156 0.10058284
## NeuralNetwork          0.12523489 0.4770767 0.09195530
## MARS                   0.11711640 0.5395096 0.08767137
## SVMR                   0.11733152 0.5393575 0.08531385
## RandomForest           0.09381957 0.7177911 0.06765723
## BoostedTrees           0.10487097 0.6312280 0.07756929
## Cubist                 0.09757177 0.6811467 0.06732976

# kbl(models_summary) |>
#   kable_styling()

Linear Model (lm): The linear model shows moderate performance, with a RMSE of 0.1326 and an Rsquared of 0.4241. While it provides a simple and interpretable model, it may struggle to capture complex relationships in the data.

Robust Linear Model (rlm): The robust linear model exhibits similar performance to the linear model but with slightly lower RMSE (0.1324) and slightly higher Rsquared (0.4281). It offers improved robustness against outliers compared to the standard linear model.

Neural Network (nn): The neural network model demonstrates better predictive performance than linear models, with a lower RMSE of 0.1300 and a higher Rsquared of 0.4514. Neural networks can capture nonlinear relationships in the data, making them suitable for complex datasets.

Multivariate Adaptive Regression Splines (mars): MARS outperforms lm, rlm and nn with a lower RMSE of 0.1260 and the highest Rsquared of 0.4797. MARS is capable of capturing nonlinear and interaction effects in the data, making it a powerful tool for predictive modeling.

Support Vector Machine Regression (svmR): SVM regression shows competitive performance with a lower RMSE of 0.1197 and a relatively high Rsquared of 0.5323. SVMs excel in capturing complex patterns in high-dimensional spaces, offering robustness and generalization capabilities.

Random Forest: Random forest emerges as the top-performing model with the lowest RMSE of 0.0949 and the highest Rsquared of 0.7203. Random forests leverage ensemble learning to combine multiple decision trees, providing excellent predictive accuracy and resilience to overfitting.

Boosted Regression Trees (boosted): Boosted regression trees offer good predictive performance with a lower RMSE than linear models but slightly lower R-squared compared to random forest. Boosted models iteratively improve predictive accuracy by combining weak learners, making them effective for complex datasets.

Cubist: Cubist regression demonstrates performance similar to boosted regression trees, with a slightly lower RMSE and slightly higher R-squared. Cubist models use a combination of regression trees and linear models to capture complex relationships in the data.

In summary, while linear models offer simplicity and interpretability, more complex models like neural networks, MARS, SVM, random forest, boosted regression trees, and cubist provide superior predictive performance, especially for datasets with nonlinear relationships and interactions. The choice of the best model depends on factors such as the nature of the data, computational resources, interpretability requirements, and the desired balance between bias and variance. In our case, overall, random forest appears to be the best-performing model based on RMSE and R-squared metrics

Important Parameter According to Random Forest:

set.seed(123)
rf_model_importance <- varImp(rfTune)$importance |>
  as.data.frame() |>
  rownames_to_column("Variable") |>
  top_n(10) |>
  arrange(desc(Overall)) |>
  mutate(importance = row_number())
set.seed(123)
varImp(rfTune) %>%
  plot(., top = max(rf_model_importance$importance), main = "Top Ten Informative Predictors of Random Forest Model")

The provided graph displays the importance of various variables in predicting an outcome, ranked from highest to lowest importance. Here’s a discussion on each variable based on their importance scores:

Mnf_Flow: This variable stands out as the most important predictor, with an overall importance score of 100. It likely has a significant impact on the outcome being predicted, suggesting that variations in manufacturing flow have a strong association with the target variable.

Brand_Code: Brand code emerges as the second most important predictor, with a relatively high importance score of 40.43. This suggests that the brand of the product or equipment being analyzed plays a substantial role in influencing the outcome.

Usage_cont: With an importance score of 39.45, usage context ranks third in importance. This variable likely captures contextual factors related to how the product or equipment is used, indicating its relevance in predicting the outcome.

Oxygen_Filler: Oxygen filler ranks fourth in importance, with a score of 22.96. This variable likely represents the presence or characteristics of oxygen filling mechanisms, indicating its influence on the predicted outcome.

Carb_Rel: Carbonation relevance follows closely behind with an importance score of 21.90. It likely denotes the significance of carbonation-related factors in determining the outcome.

Temperature: Temperature emerges as the sixth most important predictor, with a score of 20.96. This suggests that variations in temperature may have a notable impact on the outcome being predicted.

Alch_Rel: Alcohol relevance ranks seventh in importance, with a score of 20.08. This variable likely pertains to the relevance or presence of alcohol-related factors, indicating their importance in the prediction process.

Pressure_Vacuum: Pressure vacuum ranks eighth in importance, with a score of 18.46. This variable likely represents pressure and vacuum-related characteristics, highlighting their role in predicting the outcome.

Balling_Lvl: With a score of 17.56, balling level ranks ninth in importance. This variable likely refers to the level or measurement of balling, indicating its influence on the predicted outcome.

Air_Pressurer: Air pressurer rounds out the top ten predictors with an importance score of 17.38. This variable likely denotes the relevance or characteristics of air pressure mechanisms, suggesting its impact on the predicted outcome.

Overall, these variables collectively provide insights into the factors that are most influential in predicting the outcome of interest. Understanding their relative importance can guide further analysis and decision-making processes in the relevant domain.

Predictions:

In this section we will use the evaluation data set to make prediction. We might carry out some kind of pre-processing in order to make the data set favorable or workable with our random forest model.

Loading And Preprocessing The Evaluation Dataset:

df_EvalData <- read.csv('https://raw.githubusercontent.com/Umerfarooq122/Using-Predictive-analytics-to-predict-PH-of-beverages/main/StudentEvaluation%20-%20Copy.csv',
                           na.strings = c("", NA))

Let’s check if we have any missing values in our evaluation data set

colSums(is.na(df_EvalData))

##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##                 8                 1                 6                 4 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                 0                 1                 5                 3 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                 5                 0                 4                 2 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                 0                 1                 1                 4 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                 2                10                 2                 2 
##         Carb.Flow           Density               MFR           Balling 
##                 0                 1                31                 1 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 1               267                 3                 1 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                 2                 1                 3                 2 
##       Balling.Lvl 
##                 0

We certainly do have some missing values so let’s use mice package just like we did it for training data but before that let’s fix the data type and column names

df_EvalData$Brand.Code <- as.factor(df_EvalData$Brand.Code)

colnames(df_EvalData) <- names

df_EvalData <- df_EvalData %>%
  select(-PH) %>%
  mice(., m = 1, method = 'rf', print = FALSE) %>% complete()

df_EvalData <- df_EvalData %>%
  select(-Hyd_Pressure1) %>%
  mutate(PH = "")

Prediction with Random Forest:

# predict PH
prediction <- predict(rfTune, df_EvalData)

head(prediction)

##        1        2        3        4        5        6 
## 8.580191 8.499329 8.532677 8.565613 8.472016 8.537093

df_EvalData$PH <- prediction

#average ph
df_EvalData %>%
  group_by(Brand_Code) %>%
  summarise(`Average PH` = mean(PH))

## # A tibble: 4 × 2
##   Brand_Code `Average PH`
##   <fct>             <dbl>
## 1 A                  8.49
## 2 B                  8.57
## 3 C                  8.42
## 4 D                  8.60

Writing The Predictions:

write.xlsx(list('PH' = prediction, 'EvalData_complete' = df_EvalData), file = 'predictions_DJO.xlsx')

Conclusion:

In conclusion, the predictive modeling project aimed to enhance operational efficiency and product quality in manufacturing processes by developing a robust pH regulation model. Through meticulous data exploration, pre-processing, and model development, we embarked on a journey to uncover the predictive factors influencing pH variations.

Our analysis revealed the importance of leveraging advanced analytics and machine learning techniques to extract actionable insights from complex datasets. We explored various regression algorithms, including linear models, robust regression, neural networks, support vector machines, and ensemble methods like random forest and boosted regression trees. Each algorithm offered unique strengths and performance metrics, underscoring the importance of selecting models tailored to the specific characteristics of the data.

Furthermore, feature selection played a pivotal role in model development, with the random forest algorithm highlighting key predictors such as manufacturing flow, brand code, usage context, and temperature. Understanding the relative importance of these factors provided valuable insights into the underlying dynamics of pH regulation in manufacturing processes.

In evaluating model performance, metrics such as RMSE, R-squared, and mean absolute error provided a comprehensive view of predictive accuracy and generalization ability. While each model exhibited strengths and weaknesses, ensemble methods like random forest consistently outperformed other algorithms, offering superior predictive accuracy and resilience to overfitting.

The conclusion of our analysis underscores the importance of data-driven decision-making and the critical role of predictive modeling in driving innovation and efficiency in manufacturing operations. By leveraging the insights gained from this project, stakeholders can make informed decisions, optimize production processes, and ensure regulatory compliance.

Moving forward, further research and refinement of predictive models will continue to enhance our understanding of pH regulation in manufacturing processes, paving the way for continuous improvement and innovation in industrial settings. Through collaborative efforts between data scientists, domain experts, and operational stakeholders, we can harness the power of predictive analytics to unlock new opportunities and drive sustainable growth in the manufacturing industry.