Predicting the PH Of Beverages
Introduction:
As a data scientist tasked with developing a predictive model for pH regulation in manufacturing processes, the overarching objective is to leverage data-driven insights to enhance operational efficiency, ensure regulatory compliance, and optimize product quality. By harnessing advanced analytics and machine learning techniques, we aim to empower decision-makers with actionable insights that enable proactive management of pH fluctuations throughout the production cycle.
Goal:
The goal of this project is to identify relevant features and engineer variables that capture the predictive factors influencing pH variations. Utilize statistical methods and domain expertise to select the most informative features for model development. We will also try to develop and train predictive models using appropriate algorithms such as regression, time series analysis, or machine learning classifiers. Employ rigorous evaluation techniques, including cross-validation and performance metrics, to assess model accuracy and generalization ability. Our goal is also to ensure transparency and interpret-ability of the predictive model by employing techniques such as feature importance analysis, model visualization, and explanation methods to elucidate the factors driving pH predictions. We will also suggest how to integrate the predictive model with existing decision support systems or manufacturing control systems to enable automated decision-making based on pH forecasts. Facilitate seamless communication between the data science team and operational stakeholders.
By pursuing these goals, we aim to establish a robust data science framework for pH predictive modeling that not only meets regulatory obligations but also drives innovation, efficiency, and excellence in manufacturing operations.
Data Acquisition and Exploration:
In this phase we will collect and integrate diverse data sets encompassing raw materials, process parameters, environmental conditions, and historical pH measurements to build a comprehensive understanding of the manufacturing process. In our case we were provided with a data set but obviously the data set provided was not ready to be fed to an algorithm to train a model. The data set provided was in .xlsx format and we stored on remote location i.e. Github repository, for reproducibility in a comma separated values (.csv) format. The below code chunk loads the provided data set into our environment where we can preprocess and make the data ready for the algorithms.
Loading The Dataset:
Data Exploration:
Let’s check out the first few rows of the data set to confirm that all the features and observations are loaded correctly.
Brand Code | Carb Volume | Fill Ounces | PC Volume | Carb Pressure | Carb Temp | PSC | PSC Fill | PSC CO2 | Mnf Flow | Carb Pressure1 | Fill Pressure | Hyd Pressure1 | Hyd Pressure2 | Hyd Pressure3 | Hyd Pressure4 | Filler Level | Filler Speed | Temperature | Usage cont | Carb Flow | Density | MFR | Balling | Pressure Vacuum | PH | Oxygen Filler | Bowl Setpoint | Pressure Setpoint | Air Pressurer | Alch Rel | Carb Rel | Balling Lvl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B | 5.340000 | 23.96667 | 0.2633333 | 68.2 | 141.2 | 0.104 | 0.26 | 0.04 | -100 | 118.8 | 46.0 | 0 | 118 | 121.2 | 4,002 | 66.0 | 16.18 | 2,932 | 0.88 | 725.0 | 1.398 | -4.0 | 8.36 | 0.022 | 120 | 46.4 | 142.6 | 6.58 | 5.32 | 1.48 | ||
A | 5.426667 | 24.00667 | 0.2386667 | 68.4 | 139.6 | 0.124 | 0.22 | 0.04 | -100 | 121.6 | 46.0 | 0 | 106 | 118.6 | 3,986 | 67.6 | 19.90 | 3,144 | 0.92 | 726.8 | 1.498 | -4.0 | 8.26 | 0.026 | 120 | 46.8 | 143.0 | 6.56 | 5.30 | 1.56 | ||
B | 5.286667 | 24.06000 | 0.2633333 | 70.8 | 144.8 | 0.090 | 0.34 | 0.16 | -100 | 120.2 | 46.0 | 0 | 82 | 120.0 | 4,020 | 67.0 | 17.76 | 2,914 | 1.58 | 735.0 | 3.142 | -3.8 | 8.94 | 0.024 | 120 | 46.6 | 142.0 | 7.66 | 5.84 | 3.28 | ||
A | 5.440000 | 24.00667 | 0.2933333 | 63.0 | 132.6 | 0.42 | 0.04 | -100 | 115.2 | 46.4 | 0 | 0 | 0 | 92 | 117.8 | 4,012 | 65.6 | 17.42 | 3,062 | 1.54 | 730.6 | 3.042 | -4.4 | 8.24 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.42 | 3.04 | |
A | 5.486667 | 24.31333 | 0.1113333 | 67.2 | 136.8 | 0.026 | 0.16 | 0.12 | -100 | 118.4 | 45.8 | 0 | 0 | 0 | 92 | 118.6 | 4,010 | 65.6 | 17.68 | 3,054 | 1.54 | 722.8 | 3.042 | -4.4 | 8.26 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.44 | 3.04 |
A | 5.380000 | 23.92667 | 0.2693333 | 66.6 | 138.4 | 0.090 | 0.24 | 0.04 | -100 | 119.6 | 45.6 | 0 | 0 | 0 | 116 | 120.2 | 4,014 | 66.2 | 23.82 | 2,948 | 1.52 | 738.8 | 2.992 | -4.4 | 8.32 | 0.024 | 120 | 46.0 | 146.6 | 7.16 | 5.44 | 3.02 |
We can see that everything loaded correctly let’s check the structure of the data set to confirm that data types of the features.
## Rows: 2,571
## Columns: 33
## $ `Brand Code` <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", …
## $ `Carb Volume` <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, …
## $ `Fill Ounces` <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, …
## $ `PC Volume` <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1113…
## $ `Carb Pressure` <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64…
## $ `Carb Temp` <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 1…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.15…
## $ `PSC Fill` <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.…
## $ `PSC CO2` <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.…
## $ `Mnf Flow` <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -1…
## $ `Carb Pressure1` <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 1…
## $ `Fill Pressure` <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46…
## $ `Hyd Pressure1` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure2` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure3` <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Hyd Pressure4` <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, …
## $ `Filler Level` <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 1…
## $ `Filler Speed` <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014…
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65…
## $ `Usage cont` <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 1…
## $ `Carb Flow` <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902,…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, …
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1…
## $ `Pressure Vacuum` <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.…
## $ `Oxygen Filler` <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0…
## $ `Bowl Setpoint` <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, …
## $ `Pressure Setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46…
## $ `Air Pressurer` <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 1…
## $ `Alch Rel` <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.…
## $ `Carb Rel` <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.…
## $ `Balling Lvl` <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.…
We can see that most of the data did have the right format apart from
the feature Brand Code. which needs to be fixed. Everything
else looks good so far when it comes to data type but we can observe
that some of the data has missing values when it comes to few variables
or features like MFR,Brand Code, and
Filler Speed e.t.c
Descriptive Summary:
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
Visual Analytics of Features:
Since we have a lot of numeric predictors so let’s plot the histogram to see how they are distributed.
df %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = "free") +
ggtitle("Histograms of Numerical Predictors")
As we can see that predictors like
Carb Pressure,
Carb Temp, Carb Volume,
Pressure Vaccum, PC Volume, are
Fill Ounces are almost normally distributed and so does our
target variable PH. Since our aim is to predict the
PH which is a continuous number so we are dealing with a
regression problem which in turn mean that we can apply linear
regression. Application of linear regression should give us good results
but we can see that most of the predictors that are not linearly
distributed and it can be transformed to get good results but we might
not spend that much time on each predictor if the linear regression
model required a lot of improvement. At that point we might use other
regression techniques like multivariate adaptive regression splines
(MARS), Support vector machines (SVM), Boosted Trees, Random Forest and
Neural Networks e.t.c.
Let’s check out the Brand Code column now but we need to
fix the data type before we plot the column.
df$`Brand Code` <- as.factor(df$`Brand Code`)
df %>%
ggplot() +
geom_bar(aes(x = `Brand Code`)) +
ggtitle("Distribution of the Brand Codes")
we can see that we have four different brand codes i.e. A,B,C and D. A
lot of observations are from brand code B. We can also see that we have
a bunch of missing values in our brand codes. In the next section we
will take care of the missing values in our Data set.
Pre-processing Dataset:
In previous section we saw that we had a lot missing values in certain columns and in this section we will use different techniques to impute those missing values. Before imputation let’s check out the number of missing values in each column.
Checking Missing Values:
df %>%
summarise_all(list(~ sum(is.na(.)))) %>%
gather(variable, value) %>%
filter(value != 0) %>%
arrange(-value) ## # A tibble: 31 × 2
## variable value
## <chr> <int>
## 1 MFR 212
## 2 Brand Code 120
## 3 Filler Speed 57
## 4 PC Volume 39
## 5 PSC CO2 39
## 6 Fill Ounces 38
## 7 PSC 33
## 8 Carb Pressure1 32
## 9 Hyd Pressure4 30
## 10 Carb Pressure 27
## # ℹ 21 more rows
We can see that our MFR column has the highest number of
missing values which almost equal to 8.24% of the total observations. In
general it preferred by a lot of data scientist to remove columns that
have missing more than 5% of the total observations.
Brand Code also has missing values of around 4.6% of total
observations. Now in order to impute these missing values we would use
mice package from R but before that we have to fix the column names and
remove the spacing in between the names since it is not compatible with
mice package functions.
Fixing Column Names:
In this section we will create a list to hold the column names and then use that list to replace all the column names in the data frame.
names<- c("Brand_Code", "Carb_Volume", "Fill_Ounces","PC_Volume", "Carb_Pressure", "Carb_Temp","PSC" , "PSC_Fill", "PSC_CO2", "Mnf_Flow", "Carb_Pressure1", "Fill_Pressure", "Hyd_Pressure1", "Hyd_Pressure2", "Hyd_Pressure3", "Hyd_Pressure4", "Filler_Level", "Filler_Speed" , "Temperature", "Usage_cont", "Carb_Flow", "Density", "MFR", "Balling", "Pressure_Vacuum", "PH", "Oxygen_Filler", "Bowl_Setpoint",
"Pressure_Setpoint", "Air_Pressurer", "Alch_Rel" , "Carb_Rel", "Balling_Lvl")Now we replace the column names and we can go ahead the check the first rows of the columns
Brand_Code | Carb_Volume | Fill_Ounces | PC_Volume | Carb_Pressure | Carb_Temp | PSC | PSC_Fill | PSC_CO2 | Mnf_Flow | Carb_Pressure1 | Fill_Pressure | Hyd_Pressure1 | Hyd_Pressure2 | Hyd_Pressure3 | Hyd_Pressure4 | Filler_Level | Filler_Speed | Temperature | Usage_cont | Carb_Flow | Density | MFR | Balling | Pressure_Vacuum | PH | Oxygen_Filler | Bowl_Setpoint | Pressure_Setpoint | Air_Pressurer | Alch_Rel | Carb_Rel | Balling_Lvl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B | 5.340000 | 23.96667 | 0.2633333 | 68.2 | 141.2 | 0.104 | 0.26 | 0.04 | -100 | 118.8 | 46.0 | 0 | 118 | 121.2 | 4,002 | 66.0 | 16.18 | 2,932 | 0.88 | 725.0 | 1.398 | -4.0 | 8.36 | 0.022 | 120 | 46.4 | 142.6 | 6.58 | 5.32 | 1.48 | ||
A | 5.426667 | 24.00667 | 0.2386667 | 68.4 | 139.6 | 0.124 | 0.22 | 0.04 | -100 | 121.6 | 46.0 | 0 | 106 | 118.6 | 3,986 | 67.6 | 19.90 | 3,144 | 0.92 | 726.8 | 1.498 | -4.0 | 8.26 | 0.026 | 120 | 46.8 | 143.0 | 6.56 | 5.30 | 1.56 | ||
B | 5.286667 | 24.06000 | 0.2633333 | 70.8 | 144.8 | 0.090 | 0.34 | 0.16 | -100 | 120.2 | 46.0 | 0 | 82 | 120.0 | 4,020 | 67.0 | 17.76 | 2,914 | 1.58 | 735.0 | 3.142 | -3.8 | 8.94 | 0.024 | 120 | 46.6 | 142.0 | 7.66 | 5.84 | 3.28 | ||
A | 5.440000 | 24.00667 | 0.2933333 | 63.0 | 132.6 | 0.42 | 0.04 | -100 | 115.2 | 46.4 | 0 | 0 | 0 | 92 | 117.8 | 4,012 | 65.6 | 17.42 | 3,062 | 1.54 | 730.6 | 3.042 | -4.4 | 8.24 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.42 | 3.04 | |
A | 5.486667 | 24.31333 | 0.1113333 | 67.2 | 136.8 | 0.026 | 0.16 | 0.12 | -100 | 118.4 | 45.8 | 0 | 0 | 0 | 92 | 118.6 | 4,010 | 65.6 | 17.68 | 3,054 | 1.54 | 722.8 | 3.042 | -4.4 | 8.26 | 0.030 | 120 | 46.0 | 146.2 | 7.14 | 5.44 | 3.04 |
A | 5.380000 | 23.92667 | 0.2693333 | 66.6 | 138.4 | 0.090 | 0.24 | 0.04 | -100 | 119.6 | 45.6 | 0 | 0 | 0 | 116 | 120.2 | 4,014 | 66.2 | 23.82 | 2,948 | 1.52 | 738.8 | 2.992 | -4.4 | 8.32 | 0.024 | 120 | 46.0 | 146.6 | 7.16 | 5.44 | 3.02 |
We can see that the column names have been updated.
Imputing Missing values:
Now that our data set is ready for the imputation we can go ahead and use mice package with Random Forest (RF). RF is particularly advantageous due to its capability to handle complex relationships, including non-linearities and interactions among variables. This robustness makes it suitable for data sets with mixed data types, as RF can seamlessly manage both continuous and categorical variables without requiring pre-processing. By employing an ensemble of decision trees, RF reduces bias compared to simpler imputation methods like mean or median imputation, yielding more accurate estimates for missing values. Moreover, RF provides insights into variable importance, aiding in variable selection and enhancing interpretability. Integrated with built-in cross-validation in the mice package, RF imputation optimizes model hyper-parameters, ensuring generalizability to unseen data. Overall, RF imputation in mice offers a versatile and effective approach to handling missing data, contributing to more robust statistical analyses and modeling.
After taking care of the missing values we will then remove the columns that have low to almost zero variance since it might not help us in the predictions. Removing zero variance columns from a dataset before modeling serves several important purposes. First and foremost, these columns, which contain no variability as all values are the same, offer no discriminatory power to the model, essentially conveying no useful information for classification or prediction tasks. By eliminating them, the efficiency of the modeling process is enhanced, as computational resources are not wasted on training with non-informative features. Additionally, excluding zero variance columns helps avoid potential issues with certain algorithms that may encounter numerical instabilities or errors when confronted with such features. Furthermore, simplifying the model’s interpretation becomes more feasible when irrelevant features are removed, allowing for a clearer understanding of the relationship between predictors and outcomes. This process also aids in reducing redundancy and multi-collinearity within the dataset, thereby improving the stability and performance of the model.
In this case, only Hyd_Pressure1 was removed because of
its low variance. Now that our data set is ready we can go ahead check
the correlation among different columns and with our target column
PH
Features Selection:
Correlation Among Features:
Since it is a regression problem so we will need to check the colinearity among the variables and we might use Linear regression too and for linear regression co-linearity and multi-colinearity does affect the performance of the model a lot. We can drop some weak predictors or predictors that highly corelated with other predictors too.
# Option 1: Removing non-numeric columns before calculating correlation
numeric_df <- df %>%
select_if(is.numeric)
# Calculate correlation matrix and plot
correlation_matrix <- cor(numeric_df)
corrplot(correlation_matrix, title = "Correlation between Variables")We can see that PH does not have a strong positive
correlation with a lot of features. It has a moderate correlation with
features like Filler_Level, Pressure_Vaccum,
Oxygen_Filler and Bowl_Setpoint. We can also
observe multi-colinearity among
Density,Balling,
Alch_Rel,Hyd_Pressure3 and
Baling_Lvl e.t.c. Which can cause problems if we using a
linear regression since multi-colinearity affects. We can counter that
by carrying out principle component analysis (PCA). PCA will reduce the
dimension and get rid of multi-colinearity but even then we would need
other predictor to strongly correlate for linear regression. If we
instead start to drop the column there is higher chance of getting a
very low \(R^2\) which means that the
variance in the data might not get captured by the linear regression. So
we will try not remove any columns and in combination with linear
regression we will try other models like Trees and regulation rules,
Support vector machines, and Neural networks e.t.c. These algorithms are
not limited by multi-colinearity. We might not get a good performance
from neural network either since it requires a lot of data points and
for this particular study we only have 2571 observations which be
further cut down into training and testing data sets.
Model Development And Evaluation:
In this section we will develop our models. First we will split the data into 20-80. 20% for testing the algorithm’s performance while 80% for training the algorithm. Once we are done with the split we can go ahead and train our models. For this study we will use linear regression, robust linear regression, multivariate adaptive regression splines, support vector machines, decision tree, random forests, Cubist, boosting and neural networks
Splitting The Data:
Linear Regression
ctrl <- trainControl(method = "cv", number = 10)
suppressMessages({
set.seed(123)
lmtuned <- train(train_x,train_y, method = 'lm', preProcess = c("center","scale"), trControl = ctrl)
lmpred <- predict(lmtuned, test_x)
})
postResample(lmpred,test_y)## RMSE Rsquared MAE
## 0.1305819 0.4284191 0.1008029
Robust Linear Regression
set.seed(123)
rlmPCA <- train(train_x, train_y, method = "rlm", preProcess = c("center","scale"), trControl = ctrl)## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
## RMSE Rsquared MAE
## 0.1306904 0.4285156 0.1005828
Boosted Trees
set.seed(123)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
preProcess = c("center", "scale"),
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y)## RMSE Rsquared MAE
## 0.10487097 0.63122799 0.07756929
MARS
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
# tune
marsTune <- train(train_x,train_y,
method = "earth",
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv"))
marsPred <- predict(marsTune, test_x)
postResample(marsPred, test_y)## RMSE Rsquared MAE
## 0.11711640 0.53950964 0.08767137
Cubist
set.seed(123)
cubistTuned <- train(train_x, train_y,
preProcess = c("center", "scale"),
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y)## RMSE Rsquared MAE
## 0.09757177 0.68114666 0.06732976
Support Vector Machines
set.seed(123)
# tune
svmRTune <- train(train_x[, -1], train_y,
method = "svmRadial",
tuneLength = 14,
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv"))
svmRPred <- predict(svmRTune, test_x[, -1])
postResample(svmRPred, test_y)## RMSE Rsquared MAE
## 0.11733152 0.53935746 0.08531385
Evaluation:
Model Selection:
In regression analysis, evaluating the performance of models is crucial for understanding how well they capture the relationship between predictor variables and the target variable. Commonly used metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) provide insights into the accuracy of predictions, while R-squared (R²) measures the proportion of variance explained by the model. Residual analysis and cross-validation techniques further aid in assessing model assumptions and generalization to new data. By considering these evaluation metrics and techniques, analysts can make informed decisions about model selection, refinement, and interpretation, ultimately ensuring the reliability and validity of regression models in capturing underlying patterns in the data.
rbind(LinearRegression = postResample(lmpred, test_y),
RobustLinearRegression = postResample(rlmpred, test_y),
NeuralNetwork = postResample(nnPred, test_y),
MARS = postResample(marsPred, test_y),
SVMR = postResample(svmRPred, test_y),
RandomForest = postResample(rfPred, test_y),
BoostedTrees = postResample(gbmPred, test_y),
Cubist = postResample(cubistPred, test_y))## RMSE Rsquared MAE
## LinearRegression 0.13058186 0.4284191 0.10080289
## RobustLinearRegression 0.13069044 0.4285156 0.10058284
## NeuralNetwork 0.12523489 0.4770767 0.09195530
## MARS 0.11711640 0.5395096 0.08767137
## SVMR 0.11733152 0.5393575 0.08531385
## RandomForest 0.09381957 0.7177911 0.06765723
## BoostedTrees 0.10487097 0.6312280 0.07756929
## Cubist 0.09757177 0.6811467 0.06732976
Linear Model (lm): The linear model shows moderate performance, with a RMSE of 0.1326 and an Rsquared of 0.4241. While it provides a simple and interpretable model, it may struggle to capture complex relationships in the data.
Robust Linear Model (rlm): The robust linear model exhibits similar performance to the linear model but with slightly lower RMSE (0.1324) and slightly higher Rsquared (0.4281). It offers improved robustness against outliers compared to the standard linear model.
Neural Network (nn): The neural network model demonstrates better predictive performance than linear models, with a lower RMSE of 0.1300 and a higher Rsquared of 0.4514. Neural networks can capture nonlinear relationships in the data, making them suitable for complex datasets.
Multivariate Adaptive Regression Splines (mars): MARS outperforms lm, rlm and nn with a lower RMSE of 0.1260 and the highest Rsquared of 0.4797. MARS is capable of capturing nonlinear and interaction effects in the data, making it a powerful tool for predictive modeling.
Support Vector Machine Regression (svmR): SVM regression shows competitive performance with a lower RMSE of 0.1197 and a relatively high Rsquared of 0.5323. SVMs excel in capturing complex patterns in high-dimensional spaces, offering robustness and generalization capabilities.
Random Forest: Random forest emerges as the top-performing model with the lowest RMSE of 0.0949 and the highest Rsquared of 0.7203. Random forests leverage ensemble learning to combine multiple decision trees, providing excellent predictive accuracy and resilience to overfitting.
Boosted Regression Trees (boosted): Boosted regression trees offer good predictive performance with a lower RMSE than linear models but slightly lower R-squared compared to random forest. Boosted models iteratively improve predictive accuracy by combining weak learners, making them effective for complex datasets.
Cubist: Cubist regression demonstrates performance similar to boosted regression trees, with a slightly lower RMSE and slightly higher R-squared. Cubist models use a combination of regression trees and linear models to capture complex relationships in the data.
In summary, while linear models offer simplicity and interpretability, more complex models like neural networks, MARS, SVM, random forest, boosted regression trees, and cubist provide superior predictive performance, especially for datasets with nonlinear relationships and interactions. The choice of the best model depends on factors such as the nature of the data, computational resources, interpretability requirements, and the desired balance between bias and variance. In our case, overall, random forest appears to be the best-performing model based on RMSE and R-squared metrics
Important Parameter According to Random Forest:
set.seed(123)
rf_model_importance <- varImp(rfTune)$importance |>
as.data.frame() |>
rownames_to_column("Variable") |>
top_n(10) |>
arrange(desc(Overall)) |>
mutate(importance = row_number())
set.seed(123)
varImp(rfTune) %>%
plot(., top = max(rf_model_importance$importance), main = "Top Ten Informative Predictors of Random Forest Model")The provided graph displays the importance of various variables in predicting an outcome, ranked from highest to lowest importance. Here’s a discussion on each variable based on their importance scores:
Mnf_Flow: This variable stands out as the most important predictor, with an overall importance score of 100. It likely has a significant impact on the outcome being predicted, suggesting that variations in manufacturing flow have a strong association with the target variable.
Brand_Code: Brand code emerges as the second most important predictor, with a relatively high importance score of 40.43. This suggests that the brand of the product or equipment being analyzed plays a substantial role in influencing the outcome.
Usage_cont: With an importance score of 39.45, usage context ranks third in importance. This variable likely captures contextual factors related to how the product or equipment is used, indicating its relevance in predicting the outcome.
Oxygen_Filler: Oxygen filler ranks fourth in importance, with a score of 22.96. This variable likely represents the presence or characteristics of oxygen filling mechanisms, indicating its influence on the predicted outcome.
Carb_Rel: Carbonation relevance follows closely behind with an importance score of 21.90. It likely denotes the significance of carbonation-related factors in determining the outcome.
Temperature: Temperature emerges as the sixth most important predictor, with a score of 20.96. This suggests that variations in temperature may have a notable impact on the outcome being predicted.
Alch_Rel: Alcohol relevance ranks seventh in importance, with a score of 20.08. This variable likely pertains to the relevance or presence of alcohol-related factors, indicating their importance in the prediction process.
Pressure_Vacuum: Pressure vacuum ranks eighth in importance, with a score of 18.46. This variable likely represents pressure and vacuum-related characteristics, highlighting their role in predicting the outcome.
Balling_Lvl: With a score of 17.56, balling level ranks ninth in importance. This variable likely refers to the level or measurement of balling, indicating its influence on the predicted outcome.
Air_Pressurer: Air pressurer rounds out the top ten predictors with an importance score of 17.38. This variable likely denotes the relevance or characteristics of air pressure mechanisms, suggesting its impact on the predicted outcome.
Overall, these variables collectively provide insights into the factors that are most influential in predicting the outcome of interest. Understanding their relative importance can guide further analysis and decision-making processes in the relevant domain.
Predictions:
In this section we will use the evaluation data set to make prediction. We might carry out some kind of pre-processing in order to make the data set favorable or workable with our random forest model.
Loading And Preprocessing The Evaluation Dataset:
df_EvalData <- read.csv('https://raw.githubusercontent.com/Umerfarooq122/Using-Predictive-analytics-to-predict-PH-of-beverages/main/StudentEvaluation%20-%20Copy.csv',
na.strings = c("", NA))Let’s check if we have any missing values in our evaluation data set
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## 8 1 6 4
## Carb.Pressure Carb.Temp PSC PSC.Fill
## 0 1 5 3
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 5 0 4 2
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## 0 1 1 4
## Filler.Level Filler.Speed Temperature Usage.cont
## 2 10 2 2
## Carb.Flow Density MFR Balling
## 0 1 31 1
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## 1 267 3 1
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 2 1 3 2
## Balling.Lvl
## 0
We certainly do have some missing values so let’s use mice package just like we did it for training data but before that let’s fix the data type and column names
Prediction with Random Forest:
## 1 2 3 4 5 6
## 8.580191 8.499329 8.532677 8.565613 8.472016 8.537093
df_EvalData$PH <- prediction
#average ph
df_EvalData %>%
group_by(Brand_Code) %>%
summarise(`Average PH` = mean(PH))## # A tibble: 4 × 2
## Brand_Code `Average PH`
## <fct> <dbl>
## 1 A 8.49
## 2 B 8.57
## 3 C 8.42
## 4 D 8.60
Conclusion:
In conclusion, the predictive modeling project aimed to enhance operational efficiency and product quality in manufacturing processes by developing a robust pH regulation model. Through meticulous data exploration, pre-processing, and model development, we embarked on a journey to uncover the predictive factors influencing pH variations.
Our analysis revealed the importance of leveraging advanced analytics and machine learning techniques to extract actionable insights from complex datasets. We explored various regression algorithms, including linear models, robust regression, neural networks, support vector machines, and ensemble methods like random forest and boosted regression trees. Each algorithm offered unique strengths and performance metrics, underscoring the importance of selecting models tailored to the specific characteristics of the data.
Furthermore, feature selection played a pivotal role in model development, with the random forest algorithm highlighting key predictors such as manufacturing flow, brand code, usage context, and temperature. Understanding the relative importance of these factors provided valuable insights into the underlying dynamics of pH regulation in manufacturing processes.
In evaluating model performance, metrics such as RMSE, R-squared, and mean absolute error provided a comprehensive view of predictive accuracy and generalization ability. While each model exhibited strengths and weaknesses, ensemble methods like random forest consistently outperformed other algorithms, offering superior predictive accuracy and resilience to overfitting.
The conclusion of our analysis underscores the importance of data-driven decision-making and the critical role of predictive modeling in driving innovation and efficiency in manufacturing operations. By leveraging the insights gained from this project, stakeholders can make informed decisions, optimize production processes, and ensure regulatory compliance.
Moving forward, further research and refinement of predictive models will continue to enhance our understanding of pH regulation in manufacturing processes, paving the way for continuous improvement and innovation in industrial settings. Through collaborative efforts between data scientists, domain experts, and operational stakeholders, we can harness the power of predictive analytics to unlock new opportunities and drive sustainable growth in the manufacturing industry.