CUNY School of Professional Studies

Data 624 - Final Project

Team NaJoDu

Members:

Description

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Enabling Libraries

In this section I will enable required Jupiter extensions in order to keep functionality. Also, I will read the required libraries required for R.

# Library PACKAGES for R
library(parallel)    # For parallel computing
library(MASS)        # Backwards Stepwise OLS
library(dplyr)       # For data manipulation
library(tidyr)       # For data manipulation
library(reshape2)    # For Data manipulation
library(funModeling) # For fun modeling
library(naniar)      # For missing data visualiations
library(visdat)      # For correlation visualizations
library(corrplot)    # For correlations calculations
library(lattice)     # Required for Caret
library(caret)       # To find correlations cutoff
library(psych)       # For descriptibe analytics
library(ggplot2)     # For visualizations
library(prettyR)     # For descriptions
library(ggbiplot)    # For Principal Component Analysis
library(nnet)        # For Machine Learning
library(randomForest)# Random Forest
library(xgboost)     # Extreme Gradient Boosting
library(fastDummies) # Hot One Encoding

Solution

Two data sets have been provided; the descriptions are as follows:

  • StudentEvaluation.xlsx
  • StudentData.xlsx

For simplicity reasons, we have accommodated these data sets in a folder named “/data”.

Data review

Let’s see the composition of the provided data set.

Load Data sets

Let’s read the contents of the given data sets

# Loading data-sets.

# Training data set.
train.df <- readxl::read_excel('~/Project2/data624-poject2/data/StudentData.xlsx')
# Evaluation data set.
eval.df <- readxl::read_excel('~/Project2/data624-poject2/data/StudentEvaluation.xlsx')

Data Inspection

Let’s find some insights from our given data sets.

Dimensions

Let’s find out the composition of the given data sets in terms of Predictors and it’s Observations.

# Finding data dimensions.
dims <- data.frame("Train" =  dim(train.df),
                   "Eval" = dim(eval.df))
rownames(dims) <- c("Observations","Predictors")
dims
##              Train Eval
## Observations  2571  267
## Predictors      33   33

From the above table, we can observe in the Training set a total of 2571 observations with 33 predictors including PH. Also; our given Evaluation set is composed of 267 observations with 33 predictors including PH.

Composition

Let’s find current composition for this data set, such as Predictor’s name and other relevant information.

# Predictor's naming.
colnames(train.df)
##  [1] "Brand Code"        "Carb Volume"       "Fill Ounces"      
##  [4] "PC Volume"         "Carb Pressure"     "Carb Temp"        
##  [7] "PSC"               "PSC Fill"          "PSC CO2"          
## [10] "Mnf Flow"          "Carb Pressure1"    "Fill Pressure"    
## [13] "Hyd Pressure1"     "Hyd Pressure2"     "Hyd Pressure3"    
## [16] "Hyd Pressure4"     "Filler Level"      "Filler Speed"     
## [19] "Temperature"       "Usage cont"        "Carb Flow"        
## [22] "Density"           "MFR"               "Balling"          
## [25] "Pressure Vacuum"   "PH"                "Oxygen Filler"    
## [28] "Bowl Setpoint"     "Pressure Setpoint" "Air Pressurer"    
## [31] "Alch Rel"          "Carb Rel"          "Balling Lvl"

From above, we can quickly identify the naming for the predictors. The idea is to have a better understanding on the naming conventions and it’s meaning.

For example, it is noted that some of the predictors do not have a descriptive name such as PSC, MFR and PH.

pH

A measure of acidity or alkalinity of water soluble substances (pH stands for ‘potential of Hydrogen’).

A pH value is a number from 1 to 14, with 7 as the middle (neutral) point.

  • Values below 7 indicate acidity which increases as the number decreases, 1 being the most acidic.

  • Values above 7 indicate alkalinity which increases as the number increases, 14 being the most alkaline.

This scale, however, is not a linear scale like a centimeter or inch scale (in which two adjacent values have the same difference). It is a logarithmic scale in which two adjacent values increase or decrease by a factor of 10.

Let’s have a visual representation of the pH values

pH Chart

pH Chart

Since our problem is not stating what kind of Beverage Company is, we will try to determine the respective group based on the pH values given in the data-set.

# pH Summary.
pH <- train.df[,'PH']
summary(pH)
##        PH       
##  Min.   :7.880  
##  1st Qu.:8.440  
##  Median :8.540  
##  Mean   :8.546  
##  3rd Qu.:8.680  
##  Max.   :9.360  
##  NA's   :4
# pH Histogram.
hist(pH$PH,
    main = 'pH Frequency',
    xlab = 'pH',
    col = 3)

From the above results, we could deduct that ABC Beverage is producing what seems to be a mixed delicious fruit and vegetable juice blends.

Beverage

Beverage

Now, that we have established the kind of beverage, we will focus on the rest of predictors.

Summary Statistics by Predictor

Let’s take a look at the following summary table in order to have a better understanding of the data.

# Descriptive Statistics.
train.df.describe <- describeBy(train.df)

# Adding NA Column.
train.df.describe$na <- dims['Observations','Train'] - train.df.describe$n

# Creating a sumary data frame.
describe.cols <- c('vars', 'n', 'na', 'mean', 'sd', 'median', 'min', 'max', 'skew')
dplyr::select(train.df.describe, describe.cols)
##                   vars    n  na    mean      sd  median     min     max
## Brand Code*          1 2451 120     NaN      NA      NA     Inf    -Inf
## Carb Volume          2 2561  10    5.37    0.11    5.35    5.04    5.70
## Fill Ounces          3 2533  38   23.97    0.09   23.97   23.63   24.32
## PC Volume            4 2532  39    0.28    0.06    0.27    0.08    0.48
## Carb Pressure        5 2544  27   68.19    3.54   68.20   57.00   79.40
## Carb Temp            6 2545  26  141.09    4.04  140.80  128.60  154.00
## PSC                  7 2538  33    0.08    0.05    0.08    0.00    0.27
## PSC Fill             8 2548  23    0.20    0.12    0.18    0.00    0.62
## PSC CO2              9 2532  39    0.06    0.04    0.04    0.00    0.24
## Mnf Flow            10 2569   2   24.57  119.48   65.20 -100.20  229.40
## Carb Pressure1      11 2539  32  122.59    4.74  123.20  105.60  140.20
## Fill Pressure       12 2549  22   47.92    3.18   46.40   34.60   60.40
## Hyd Pressure1       13 2560  11   12.44   12.43   11.40   -0.80   58.00
## Hyd Pressure2       14 2556  15   20.96   16.39   28.60    0.00   59.40
## Hyd Pressure3       15 2556  15   20.46   15.98   27.60   -1.20   50.00
## Hyd Pressure4       16 2541  30   96.29   13.12   96.00   52.00  142.00
## Filler Level        17 2551  20  109.25   15.70  118.40   55.80  161.20
## Filler Speed        18 2514  57 3687.20  770.82 3982.00  998.00 4030.00
## Temperature         19 2557  14   65.97    1.38   65.60   63.60   76.20
## Usage cont          20 2566   5   20.99    2.98   21.79   12.08   25.90
## Carb Flow           21 2569   2 2468.35 1073.70 3028.00   26.00 5104.00
## Density             22 2570   1    1.17    0.38    0.98    0.24    1.92
## MFR                 23 2359 212  704.05   73.90  724.00   31.40  868.60
## Balling             24 2570   1    2.20    0.93    1.65   -0.17    4.01
## Pressure Vacuum     25 2571   0   -5.22    0.57   -5.40   -6.60   -3.60
## PH                  26 2567   4    8.55    0.17    8.54    7.88    9.36
## Oxygen Filler       27 2559  12    0.05    0.05    0.03    0.00    0.40
## Bowl Setpoint       28 2569   2  109.33   15.30  120.00   70.00  140.00
## Pressure Setpoint   29 2559  12   47.62    2.04   46.00   44.00   52.00
## Air Pressurer       30 2571   0  142.83    1.21  142.60  140.80  148.20
## Alch Rel            31 2562   9    6.90    0.51    6.56    5.28    8.62
## Carb Rel            32 2561  10    5.44    0.13    5.40    4.96    6.06
## Balling Lvl         33 2570   1    2.05    0.87    1.48    0.00    3.66
##                    skew
## Brand Code*          NA
## Carb Volume        0.39
## Fill Ounces       -0.02
## PC Volume          0.34
## Carb Pressure      0.18
## Carb Temp          0.25
## PSC                0.85
## PSC Fill           0.93
## PSC CO2            1.73
## Mnf Flow           0.00
## Carb Pressure1     0.05
## Fill Pressure      0.55
## Hyd Pressure1      0.78
## Hyd Pressure2     -0.30
## Hyd Pressure3     -0.32
## Hyd Pressure4      0.55
## Filler Level      -0.85
## Filler Speed      -2.87
## Temperature        2.39
## Usage cont        -0.54
## Carb Flow         -0.99
## Density            0.53
## MFR               -5.09
## Balling            0.59
## Pressure Vacuum    0.53
## PH                -0.29
## Oxygen Filler      2.66
## Bowl Setpoint     -0.97
## Pressure Setpoint  0.20
## Air Pressurer      2.25
## Alch Rel           0.88
## Carb Rel           0.50
## Balling Lvl        0.59

Categorical variable analysis

From the above table, it is noted that the first row does not produce meaningful insights, that is due to the presence of categorical data. Let’s take a look and see what information it provides.

# Number of records for each respective company.
brand.code.describe <- table(train.df[,"Brand Code"])
brand.code.describe
## 
##    A    B    C    D 
##  293 1239  304  615

Let’s create a visualization of the above results.

# Categorical variable visualizations.
ggplot(train.df, aes(x = `Brand Code`)) +
        geom_bar() +
        xlab("Brand Code") +
        ggtitle("Categorical variable analysis", subtitle = "Training data-set") +    
        theme(axis.text.x = element_text(angle = 45, hjust = 1),
             panel.background = element_rect(fill = "white", colour = "grey50"))

NA’s and missing values analysis

Let’s have an in-dept evaluation of the missing values in the data sets.

# Procedure to visualize missing data in the training data-set.
vis_miss(train.df)

# Procedure to visualize missing data in the test data-set.
vis_miss(eval.df)

As we can appreciate, from the above graphs, there are missing values on diverse predictors. An approach will be evaluated in order to determine how to deal with missing values. Current possible alternatives will be random generation of values, mean replacement or k-neighbors.

Correlations

Let’s find some valuable information related to correlated data.

# Procedure to visualize correlations from the training data-set.
vis_cor(train.df[,-1])

From the above heat map, we notice how some predictors are highly correlated among themselves. Let’s take a look at the below table for better numerical understanding.

# Finding correlation values.
train.df.correlations <- cor(train.df[,-1], use = "na.or.complete")
train.df.highlyCorrelated <- findCorrelation(train.df.correlations, .9,
                                            names = TRUE)
train.df.highlyCorrelated
## [1] "Balling"       "Hyd Pressure3" "Alch Rel"      "Balling Lvl"  
## [5] "Filler Level"  "Filler Speed"

From the above results, we noticed how the following predictors are highly correlated.

# Highly correlated predictors table.
train.df.correlations[, train.df.highlyCorrelated]
##                        Balling Hyd Pressure3      Alch Rel  Balling Lvl
## Carb Volume        0.813591663   0.059336000  0.8158331166  0.813237925
## Fill Ounces       -0.071941481  -0.101515495 -0.1136586129 -0.069129410
## PC Volume         -0.206627558  -0.013651918 -0.1827826250 -0.216460880
## Carb Pressure      0.455251018   0.005292565  0.4599844691  0.459400765
## Carb Temp          0.012302398  -0.021060666  0.0126474400  0.015336902
## PSC               -0.064304807  -0.018638895 -0.0577414576 -0.063795760
## PSC Fill          -0.016029764  -0.070111342 -0.0165666851 -0.013751597
## PSC CO2           -0.037908529   0.005570008 -0.0472198255 -0.041621468
## Mnf Flow           0.111756786   0.766609456  0.0266134859  0.040614078
## Carb Pressure1     0.069514100   0.372447094  0.0083014993  0.028748873
## Fill Pressure     -0.153454574   0.438994867 -0.1995758740 -0.177579909
## Hyd Pressure1      0.028843862   0.606055749  0.0072204836 -0.005205960
## Hyd Pressure2      0.107994707   0.917601008  0.0389957832  0.030680288
## Hyd Pressure3      0.125081415   1.000000000  0.0468596443  0.038754049
## Hyd Pressure4     -0.564045701   0.045841797 -0.6410271129 -0.540084684
## Filler Level       0.005039554  -0.519627534  0.0376490394  0.044237022
## Filler Speed       0.040607335   0.025225701 -0.0013221398  0.007329327
## Temperature       -0.251482907  -0.101933891 -0.2596315732 -0.244531477
## Usage cont         0.075267723   0.418701265 -0.0216651650  0.028476778
## Carb Flow         -0.106306866  -0.359026807  0.0008384732 -0.059512299
## Density            0.952312453   0.040016264  0.9157797557  0.955090031
## MFR                0.053828644   0.005859363  0.0028989305  0.019651642
## Balling            1.000000000   0.125081415  0.9412251195  0.987672698
## Pressure Vacuum   -0.160067789  -0.670471649 -0.0457739630 -0.040975418
## PH                 0.076700227  -0.268101792  0.1666822277  0.109371168
## Oxygen Filler     -0.113553207  -0.361595043 -0.0472027504 -0.074252627
## Bowl Setpoint      0.019789436  -0.521523402  0.0566897793  0.060590541
## Pressure Setpoint -0.203360219   0.465186779 -0.2590705556 -0.237481474
## Air Pressurer     -0.097453982  -0.060687339 -0.0785445679 -0.080362175
## Alch Rel           0.941225120   0.046859644  1.0000000000  0.942980106
## Carb Rel           0.854258198   0.015161342  0.8768039435  0.868287153
## Balling Lvl        0.987672698   0.038754049  0.9429801060  1.000000000
##                   Filler Level Filler Speed
## Carb Volume       -0.035059790  0.005145162
## Fill Ounces        0.005731206  0.046057991
## PC Volume          0.244805472 -0.071446523
## Carb Pressure     -0.005891435  0.029834711
## Carb Temp          0.004407884  0.025934338
## PSC               -0.007903061 -0.002305653
## PSC Fill           0.049625985 -0.018233323
## PSC CO2           -0.042996111 -0.004757648
## Mnf Flow          -0.601675814  0.019642903
## Carb Pressure1    -0.421943452  0.015465888
## Fill Pressure     -0.453216647 -0.220324539
## Hyd Pressure1     -0.026973561 -0.025814734
## Hyd Pressure2     -0.412597662  0.066784872
## Hyd Pressure3     -0.519627534  0.025225701
## Hyd Pressure4     -0.102826457 -0.253878216
## Filler Level       1.000000000  0.057029484
## Filler Speed       0.057029484  1.000000000
## Temperature        0.046355068 -0.069976280
## Usage cont        -0.365969318  0.044654810
## Carb Flow          0.024240046 -0.063559934
## Density            0.001856279  0.027219003
## MFR                0.064052585  0.951422388
## Balling            0.005039554  0.040607335
## Pressure Vacuum    0.349663826  0.052296859
## PH                 0.352043962 -0.040882953
## Oxygen Filler      0.224897341 -0.040001872
## Bowl Setpoint      0.977381056  0.022198204
## Pressure Setpoint -0.426970498 -0.068225074
## Air Pressurer     -0.124631532 -0.005162739
## Alch Rel           0.037649039 -0.001322140
## Carb Rel           0.117920114  0.005468427
## Balling Lvl        0.044237022  0.007329327

Let’s find the predictors that are highly correlated for each variable.

1. Balling

Let’s identify the highly correlated predictors to Balling.

# 'Balling' highly correlated predictors.
row.names(subset(train.df.correlations, 
                           train.df.correlations[,'Balling']>0.9 
                           &  train.df.correlations[,'Balling']<1))
## [1] "Density"     "Alch Rel"    "Balling Lvl"

From the above list, it is determined that ‘Density’, ‘Alch Rel’ and ‘Balling Lvl’ are highly correlated to ‘Balling’.

2. Hyd Pressure3

Let’s identify highly highly correlated predictors to Hyd Pressure3.

# 'Hyd Pressure3' highly correlated predictors.
row.names(subset(train.df.correlations, 
                           train.df.correlations[,'Hyd Pressure3']>0.9 
                           &  train.df.correlations[,'Hyd Pressure3']<1))
## [1] "Hyd Pressure2"

From the above list, it is determined that ‘Hyd Pressure2’ is highly correlated to ‘Hyd Pressure3’.

3. Alch Rel

Let’s identify the highly correlated predictors to Alch Rel.

# 'Alch Rel' highly correlated predictors.
row.names(subset(train.df.correlations, 
                           train.df.correlations[,'Alch Rel']>0.9 
                           &  train.df.correlations[,'Alch Rel']<1))
## [1] "Density"     "Balling"     "Balling Lvl"

From the above list, it is determined that ‘Density’, ‘Balling’ and ‘Balling Lvl’ are highly correlated to ‘Alch Rel’.

4. Balling Lvl

Let’s identify the highly correlated predictors to Balling Lvl.

# 'Balling Lvl' highly correlated predictors.
row.names(subset(train.df.correlations, 
                           train.df.correlations[,'Balling Lvl']>0.9 
                           &  train.df.correlations[,'Balling Lvl']<1))
## [1] "Density"  "Balling"  "Alch Rel"

From the above list, it is determined that ‘Density’, ‘Balling’ and ‘Alch Rel’ are highly correlated to ‘Balling Lvl’.

5. Filler Level

Let’s identify the highly correlated predictors to Filler Level.

# 'Filler Level' highly correlated predictors.
row.names(subset(train.df.correlations, 
                           train.df.correlations[,'Filler Level']>0.9 
                           &  train.df.correlations[,'Filler Level']<1))
## [1] "Bowl Setpoint"

From the above list, it is determined that ‘Bowl Setpoint’ is highly correlated to ‘Filler Level’.

6. Filler Speed

Let’s identify the highly correlated predictors to Filler Speed.

# 'Filler Speed' highly correlated predictors.
row.names(subset(train.df.correlations, 
                           train.df.correlations[,'Filler Speed']>0.9 
                           &  train.df.correlations[,'Filler Speed']<1))
## [1] "MFR"

From the above list, it is determined that ‘MFR’ is highly correlated to ‘Filler Speed’.

Near Zero Variance

In this section, we will focus on the Near Zero Variance value for all predictors. That is, we will diagnose predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large.

# Procedure to find near zero variance.
train.df.ZeroVar <- nearZeroVar(train.df, names = TRUE)
train.df.ZeroVar
## [1] "Hyd Pressure1"

Let’s have a visualization for this predictor in order to have a better idea for it’s composition.

# 'Hyd Pressure1' Frequency Histogram.
zeroVar.df <- data.frame(train.df[,"PH"],
                         train.df[,"Hyd Pressure1"])
zeroVar.df <- zeroVar.df[complete.cases(zeroVar.df),]

hist(zeroVar.df[,"Hyd.Pressure1"],
    main = 'Hyd Pressure1 Frequency',
    xlab = 'Hyd Pressure1',
    col = 2)

From the above analysis, it is determined that ‘Hyd Pressure1’ is considered a zero variance predictor; hence, this predictor will not be included as part of the modeling process since it is determine that it doe not have much information available in order to improve our modeling process.

Extra exploratory analysis

Let’s do some extra exploratory analysis in order to get the most information available from our data set.

Let’s take a look at the following charts in order to get some insights.

# Data visualization procedure
plot_num(train.df, bins = 10)

From the above charts, we notice some interesting patterns. For example, some of the predictors seem to follow a near normal distribution and seems to be continuous in terms of measurements; others, present some values far from what seems to be a cluster of observations, while others seems to have some sort of categorical measurements in the form of numerical value, that is ‘Pressure SetPoint’ for example.

Let’s analyze some key predictors that seems to have very interesting values observed.

Mnf Flow

This predictor seems to have interesting values, let’s have a better understanding.

# 'Mnf Flow' patterns.
temp.df <- data.frame(train.df[,"PH"],
                      train.df[,"Mnf Flow"])
temp.df <- temp.df[complete.cases(temp.df),]

# Plotting patterns.
plot(temp.df[,"PH"] ~ temp.df[,"Mnf.Flow"],
    main = "'PH' levels based on 'Mnf Flow'",
    xlab = "Mnf Flow",
    ylab = "PH",
    col = 4)

In this particular case, is very interesting to note negative values; there’s no apparent reason as to why this is showing like that. Possible reason could be that NA values were imputed as -100. In this particular case, we will replace these -100 values for randomly selected values and run a Near Zero Variance, depending on the results, we could then discard them or find a different approach.

Let’s have a better understanding for these negative values.

# Procedure to find extra information about 'Mnf Flow'.
temp.df <- subset(temp.df, temp.df[,"Mnf.Flow"] < 0)
desc.df <- round(describe.factor(as.factor(temp.df[,"Mnf.Flow"])),0)
desc.df
##                                 
## as.factor(temp.df[, "Mnf.Flow"]) -100 -100.2
##                          Count    605    578
##                          Percent   51     49

From the above results, we noticed that there are two very near consecutive values, these are \(-100.2\) and \(-100.0\); the total number of observations sum to \(1183\), which represents a \(46\%\) of observations from the \(2571\) given data-set. This means that almost half of the observations present this “problematic” behavior.

Due to the high number of given observations and odd pattern not following observed behavior from other observations, we considered that is better not to include this predictor into our modeling process.

Hyd Pressure2

This predictor seems to have interesting values, let’s have a better understanding.

# 'Hyd Pressure2' patterns.
temp.df <- data.frame(train.df[,"PH"],
                         train.df[,"Hyd Pressure2"])
temp.df <- temp.df[complete.cases(temp.df),]

# Plotting patterns.
plot(temp.df[,"PH"] ~ temp.df[,"Hyd.Pressure2"],
    main = "'PH' levels based on 'Hyd Pressure2'",
    xlab = "Hyd Pressure2",
    ylab = "PH",
    col = 4)

Let’s have a similar analysis performed on this predictor.

# Procedure to find extra information about 'Hyd Pressure2'.
temp.df <- subset(temp.df, temp.df[,"Hyd.Pressure2"] < 1)
desc.df <- round(describe.factor(as.factor(temp.df[,"Hyd.Pressure2"])),0)
desc.df
##                                      
## as.factor(temp.df[, "Hyd.Pressure2"])   0 0.2
##                               Count   776 106
##                               Percent  88  12

From the above results, we noticed that there are two very near consecutive values, these are \(0.0\) and \(0.2\); the total number of observations sum to \(882\), which represents a \(34.3\%\) of observations from the \(2571\) given data-set. This means that just about one third of observations present this “problematic” behavior.

From our point of view, is better not to include this predictor; we considered that this predictor could not explain in an accurate way about one third of the PH values based on these odd behaviors that this predictor currently show.

Hyd Pressure3

This predictor seems to have interesting values, let’s have a better understanding.

# 'Hyd Pressure3' patterns.
temp.df <- data.frame(train.df[,"PH"],
                         train.df[,"Hyd Pressure3"])
temp.df <- temp.df[complete.cases(temp.df),]

# Plotting patterns.
plot(temp.df[,"PH"] ~ temp.df[,"Hyd.Pressure3"],
    main = "'PH' levels based on 'Hyd Pressure3'",
    xlab = "Hyd Pressure3",
    ylab = "PH",
    col = 4)

Let’s have a similar analysis performed on this predictor.

# Procedure to find extra information about 'Hyd Pressure3'.
temp.df <- subset(temp.df, temp.df[,"Hyd.Pressure3"] < 1)
desc.df <- round(describe.factor(as.factor(temp.df[,"Hyd.Pressure3"])),0)
desc.df
##                                      
## as.factor(temp.df[, "Hyd.Pressure3"])   0 -1.2
##                               Count   812   70
##                               Percent  92    8

From the above results, we noticed that there are two very near consecutive values, these are \(0.0\) and \(0.2\); the total number of observations sum to \(882\), which represents a \(34.3\%\) of observations from the \(2571\) given data-set. This means that just about one third of observations present this “problematic” behavior.

From our point of view, is better not to include this predictor; we considered that this predictor could not explain in an accurate way about one third of the PH values based on these odd behaviors that this predictor currently show.

Filler Speed

This predictor seems to have interesting values, let’s have a better understanding.

# 'Filler Speed' patterns.
temp.df <- data.frame(train.df[,"PH"],
                      train.df[,"Filler Speed"])
temp.df <- temp.df[complete.cases(temp.df),]

# Plotting patterns.
plot(temp.df[,"PH"] ~ temp.df[,"Filler.Speed"],
    main = "'PH' levels based on 'Filler Speed'",
    xlab = "Filler Speed",
    ylab = "PH",
    col = 4)

In this particular case, seems to follow what seems to be a “natural” process, hence we will include this predictor into our modeling process.

Carb Flow

This predictor seems to have interesting values, let’s have a better understanding.

# 'Filler Speed' patterns.
temp.df <- data.frame(train.df[,"PH"],
                      train.df[,"Carb Flow"])
temp.df <- temp.df[complete.cases(temp.df),]

# Plotting patterns.
plot(temp.df[,"PH"] ~ temp.df[,"Carb.Flow"],
    main = "'PH' levels based on 'Carb Flow'",
    xlab = "Carb Flow",
    ylab = "PH",
    col = 4)

In this particular case, seems to follow what seems to be a “natural” process, hence we will include this predictor into our modeling process.

Pressure Setpoint

This predictor seems to have interesting values, let’s have a better understanding.

# 'Pressure Setpoint' patterns.
temp.df <- data.frame(train.df[,"PH"],
                      train.df[,"Pressure Setpoint"])
temp.df <- temp.df[complete.cases(temp.df),]

# Plotting patterns.
plot(temp.df[,"PH"] ~ temp.df[,"Pressure.Setpoint"],
    main = "'PH' levels based on 'Pressure Setpoint'",
    xlab = "Pressure Setpoint",
    ylab = "PH",
    col = 4)

In this particular case, seems to follow what seems to be a “natural” process, even though some values are out of what could be considered as categorical values; hence we will include this predictor into our modeling process.

Dimension reduction

From the above analysis, we believe that it will be “safe” to remove the highly correlated predictors, near zero variance predictors and some “problematic” predictors as described above; thus to avoid variance reduction explanation due to highly correlated predictors that seems to be related to diverse processes in the manufacturing process.

# Dimension reduction by dropping highly correlated variables.
removeVars <- c('Balling', 'Hyd Pressure3', 'Alch Rel', 'Balling Lvl', 'Filler Level', 
                'Filler Speed', 'Hyd Pressure1', 'Mnf Flow', 'Hyd Pressure2', 'Hyd Pressure3' )

myvars <- names(train.df) %in% removeVars 

# Dropping highly correlated variables.
train.df.reduced <- train.df[,!myvars]          # Training
eval.df.reduced <- eval.df[,!myvars]            # Evaluation

Let’s have a few final visualizations from our remaining data-set.

Missing Values and Predictors

# Procedure to visualize missing data in the reduced training data-set
vis_miss(train.df.reduced)

Final Correlations

# Procedure to visualize correlations in from the reduced training data-set
vis_cor(train.df.reduced[,-1])

After all of the above reductions, we are now left with the following dimensions:

# Dimension of training data set after predictors elimination.
dim(train.df.reduced)
## [1] 2571   24

In this case, we notice how our new reduced data-set has a total of 23 predictors plus one for the PH considered to be the response variable.

Pre-process

In this section, we will focus on the pre-processing steps needed in order to focus on the modeling processes.

First, we will focus of missing data imputing. For this process we will consider three possible actions in the following order: a) random value generation b) mean value imputing c) K-nearest (less likely option).

Second, after the missing value imputing process has been determined, then we will standardize our data sets.

In order to pre-process the data, what we need to do is to center and scale as follows:

\[z_{i,j}= \frac{x_{i,j} - \mu_j}{\sigma_j}\]

Where \(i\) represents the observations and \(j\) the predictor columns; in that sense, \(z_{i,j}\) relates to each predictor’s observation \(i\) for each predictor \(j\) including the response variable PH; while \(x_{i,j}\) relates to each observation \(i\) for each predictor \(j\); \(\mu_j\) stands for the mean value for each predictor \(j\), while \(\sigma_j\) stands for the standard deviation for each predictor \(j\) including the response variable PH.

The above process will ensure a better management of units since all the units will be equally representative avoiding future problems due to different scales and measurement units involved.

Third, depending of need; we might use transformations, that is Box-Cox transformations might be required.

A Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if our data isn’t normal, by applying a Box-Cox means that we will able to run a broader number of tests.

1) Input missing data

In order to do so, let’s take a look at the remaining data “behavior”.

# Data visualization procedure
plot_num(train.df.reduced, bins = 10)

From previous analysis, it was determined that the number of missing observations per Predictor is very low. Hence, a random selection of values will be selected in order to replace the missing values, thus it will not affect in a considerable way any since the total missing data is about \(1.2\%\)

Brand Code

In this section we discuss how we will input these missing values, a suggested approach is to random select values for the brand by assuming that there are only 4 different brands; an alternative will be to create a new brand from the NA values. The definitive approach will be to use a machine learning model in order to predict the missing brand.

Let’s take a look at the current composition.

# Procedure to find missing percentage values from 'Brand Code'
desc.df <- round(describe.factor(as.factor(train.df.reduced$`Brand Code`)),0)
desc.df
##                                         
## as.factor(train.df.reduced$`Brand Code`)    B   D   C   A <NA>
##                                  Count   1239 615 304 293  120
##                                  Percent   48  24  12  11    5

In this section, we will describe the process of the randomly imputed data, thus in order to make our modeling process as accurate as possible, we will utilize the given percentages from the above table in order to generate our random Brands.

# Procedure to replace missing values from 'Brand Code' <-- Not evaluated, just for example processing.
BC_naReplace <- function(df){
    brand <- c('A','B','C','D')
    isNA <- is.na(df$`Brand Code`)
    set.seed(123)
    tempSamp <- sample(brand, size=sum(isNA), replace=TRUE, prob=c(0.49, 0.25, 0.13, 0.13) )
    df$`Brand Code`[isNA] <- tempSamp
    return(df)
}

# Filling NA values from training data frame
train.df.clean <- BC_naReplace(train.df.reduced)

# Filling NA values from evaluation data frame
eval.df.clean <- BC_naReplace(eval.df.reduced)

# Calculating final outcome after replacement
desc.df <- round(describe.factor(as.factor(train.df.clean$`Brand Code`)),0)
desc.df
##                                       
## as.factor(train.df.clean$`Brand Code`)    B   D   A   C
##                                Count   1264 635 355 317
##                                Percent   49  25  14  12

If we pay enough attention, we notice that the percentages did not change by much.

Pressure Setpoint

Let’s do a similar replacement for this predictor, for that, we will consider the readings to be “Categorical”.

# Procedure to find missing percentage values from 'Pressure Setpoint'
desc.df <- round(describe.factor(as.factor(train.df.clean$`Pressure Setpoint`)),0.5)
desc.df
##                                              
## as.factor(train.df.clean$`Pressure Setpoint`)     46   50    48   44 <NA>
##                                       Count   1322.0 1002 125.0 96.0 12.0
##                                       Percent   51.4   39   4.9  3.7  0.5
##                                              
## as.factor(train.df.clean$`Pressure Setpoint`)   52 46.4 46.6 46.8
##                                       Count   11.0    1    1    1
##                                       Percent  0.4    0    0    0

From above, we notice the need to not only replace the missing values but to assign the “proper categorical” value for the recorded decimal values. This, will be considered to be OK due to the low number of observations listed.

# Procedure to replace missing values from 'Pressure Setpoint'
PS_naReplace <- function(df){
    set.seed(123)
    pressure <- c('44','46','48','50','52')
    isNA <- is.na(df$`Pressure Setpoint`)
    tempSamp <- sample(pressure, size=sum(isNA), replace=TRUE, prob=c(0.0385, 0.516, 0.0495, 0.3915, 0.0045) )
    df$`Pressure Setpoint`[isNA] <- tempSamp
    return(df)
}

# Filling NA values from training data frame
train.df.clean <- PS_naReplace(train.df.clean)

# Filling NA values from evaluation data frame
eval.df.clean <- PS_naReplace(eval.df.clean)

# Rounding values to floor since there's no value for 47
train.df.clean$`Pressure Setpoint` <- floor(as.numeric(train.df.clean$`Pressure Setpoint`))
eval.df.clean$`Pressure Setpoint` <- floor(as.numeric(eval.df.clean$`Pressure Setpoint`))

# Calculating final outcome after replacement
desc.df <- round(describe.factor(as.factor(train.df.clean$`Pressure Setpoint`)),1)
desc.df
##                                              
## as.factor(train.df.clean$`Pressure Setpoint`)     46     50    48   44
##                                       Count   1330.0 1007.0 127.0 96.0
##                                       Percent   51.7   39.2   4.9  3.7
##                                              
## as.factor(train.df.clean$`Pressure Setpoint`)   52
##                                       Count   11.0
##                                       Percent  0.4

Remainig predictors

For the remaining predictors, we will perform a random value input by considerating the minimum value and the maximum value for each predictor.

# Procedure to input random values in the remaining predictors.

# Procedure to replace NA with automated random value
P_naReplace <- function(df){  
    cPred <- dim(df)[2]
    for (i in 2:cPred){
        isNA <- is.na(df[,i])
        minP <- min(df[,i], na.rm = TRUE)
        maxP <- max(df[,i], na.rm = TRUE)
        set.seed(123)
        tempSamp <- round(runif(sum(isNA), min=minP, max=maxP),2)
        df[isNA,i] <- tempSamp
    }
    return(df)  
}


# Filling NA values from training data frame
train.df.clean <- P_naReplace(train.df.clean)

# Filling NA values from evaluation data frame
eval.df.clean <- P_naReplace(eval.df.clean[,-19])
eval.df.clean$PH <- NA

# Sorting Columns procedure
# List all columns from the reduced data-set in order to sort an place 'PH' first.
selCols <- c('PH', 'Brand Code', 'Carb Volume', 'Fill Ounces', 'PC Volume', 'Carb Pressure', 'Carb Temp',
             'PSC', 'PSC Fill', 'PSC CO2', 'Carb Pressure1', 'Fill Pressure', 'Hyd Pressure4', 'Temperature',
             'Usage cont', 'Carb Flow', 'Density', 'MFR', 'Pressure Vacuum', 'Oxygen Filler', 'Bowl Setpoint',
             'Pressure Setpoint', 'Air Pressurer', 'Carb Rel')

train.df.clean <- dplyr::select(train.df.clean, selCols)
eval.df.clean <- dplyr::select(eval.df.clean, selCols)

2) Preprocessing

In this section, we will make use of an automated processing algorithm employed by R as follows.

# Procedure to Pre-Process

# Need to find mu and sigma to revert the preprocess in terms of centered and scaled.
ph.mu <- mean(train.df.clean$PH)
ph.sigma <- sd(train.df.clean$PH)

# Pre-process function
trans <- preProcess(train.df.clean, method = c("center", "scale"))
trans
## Created from 2571 samples and 24 variables
## 
## Pre-processing:
##   - centered (23)
##   - ignored (1)
##   - scaled (23)

From the above results, we noticed how the predictors were centered and scaled automatically as a pre-process.

# Procedure to re-calculate new values based on the pre-process
train.df.transformed <- predict(trans, train.df.clean)
eval.df.transformed <- predict(trans, eval.df.clean)
# Data visualization procedure
plot_num(train.df.transformed, bins = 10)

From the above graphs, we notice an improvement in the values. We might be able to improve linearity by selecting a Box-Cox transformation if needed.

Advanced Analysis

Now that we have the final data frames, now we are ready further explore via unsupervised and supervised machine learning.

A few of the methods including K-Means Clustering and Principal Component Analysis will be discussed here. As the goal of the project is to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

This type of analysis may provide further information in regards to how similar or dissimilar each of the beverages are to each other. Furthermore, information regarding the specific processes involved.

K-means Clustering Analysis

K-means clustering attempts to find an optimal number of clusters (or groups of similar characteristics) within the data set. Presumably, with four different beverages, there should be four different centroids. However, this assumption may be incorrect as one or more beverages may be more similar to each other. For instance, if a Fruit Juice may be similar to a Vegetable Juice and thus the manufacturing processes may be nearly identical.

K-means clustering allows the data scientist to identify the number of clusters that exist. This process is performed by utilizing the elbow method and via calculations of the within-clusters sum of squares.

We can attempt to cluster of beverages that exist within this data set. Note that this type of analysis takes distance between variables as a consideration so the data must be centered, scaled, and continuous. By removing the Brand Code, the response variable, the data set satisfies all requirements and can be processed for clustering.

# K-means Clustering Analysis.
# Using train.df.transformed
train_df_trans <- subset(train.df.transformed, select = -`Brand Code`)
set.seed(123)
wcss = vector()
for (i in 1:10) {wcss[i] = sum(kmeans(train_df_trans, i)$withinss)}
plot(1:10,
     wcss,
     type = 'b',
     main = paste('The Elbow Method'),
     xlab = 'Number of clusters',
     ylab = 'WCSS')

The largest bend in the visualization above appears to be at cluster 3. This is interesting given that there are four different types of beverages. However, as mentioned above, two of the beverages may share more similar characteristics than previously assumed. Let us evaluate the 3 clusters in our data set and further investigate into its granular details.

# K-means Clustering Analysis - Granular details.
set.seed(123)
kmeans = kmeans(x = train_df_trans, centers = 3)
y_kmeans = kmeans$cluster
t(kmeans$centers)
##                              1           2             3
## PH                 0.255127700 -0.63888983  0.3583553909
## Carb Volume        1.180583357 -0.39723508 -0.5850009055
## Fill Ounces       -0.120161745  0.16357929 -0.0484037888
## PC Volume         -0.246580563 -0.19269185  0.3638163886
## Carb Pressure      0.651040121 -0.25186724 -0.2938431017
## Carb Temp          0.042440846 -0.05144381  0.0115457423
## PSC               -0.123748376  0.06843128  0.0378338482
## PSC Fill          -0.004389614 -0.02173976  0.0225259170
## PSC CO2           -0.093875059  0.08416548  0.0004283861
## Carb Pressure1     0.071787717  0.47476974 -0.4729073314
## Fill Pressure     -0.281033078  0.77977445 -0.4613715844
## Hyd Pressure4     -0.808996772  0.43516913  0.2580265812
## Temperature       -0.315279771  0.05274508  0.2029809016
## Usage cont         0.024994023  0.60215820 -0.5475820768
## Carb Flow         -0.051853450 -0.31977779  0.3212910369
## Density            1.323020070 -0.52170936 -0.5884822434
## MFR                0.054997328 -0.07352428  0.0209751598
## Pressure Vacuum   -0.097919269 -0.50409237  0.5192661147
## Oxygen Filler     -0.096802623 -0.40327289  0.4300095308
## Bowl Setpoint      0.119309382 -0.89229506  0.6878377342
## Pressure Setpoint -0.416971819  0.96247140 -0.5140620437
## Air Pressurer     -0.125724172  0.11178398  0.0013945614
## Carb Rel           1.249054357 -0.60610638 -0.4560368575

As the numbers demonstrate, each cluster represents the mean of that particular category, and thus occupying a certain area of the n-th dimension of the data set. Judging from the numbers alone, the differences can be difficult to discern, so visualization may assist us in determining the differences.

Prior to going further, histograms of the data grouped by each brand will be plotted to help further visualize.

# K-means Clustering Analysis - Group Analysis.
A <- train_df_trans[train.df.transformed$`Brand Code` == 'A',]
B <- train_df_trans[train.df.transformed$`Brand Code` == 'B',]
C <- train_df_trans[train.df.transformed$`Brand Code` == 'C',]
D <- train_df_trans[train.df.transformed$`Brand Code` == 'D',]

Let’s have a data visualization and description for each Brand.

Brand Code = “A”

Let’s have a data description and data visualization for Brand Code = “A”.

# Data visualization and description for Brand Code = "A".
plot_num(A, bins = 10)

describe.by(A)
##                   vars   n  mean   sd median trimmed  mad   min  max range
## PH                   1 355 -0.29 0.95  -0.15   -0.27 0.86 -2.81 1.93  4.73
## Carb Volume          2 355  0.30 0.88   0.46    0.36 0.74 -3.09 2.40  5.49
## Fill Ounces          3 355  0.11 1.03   0.05    0.09 0.87 -3.49 3.73  7.22
## PC Volume            4 355 -0.07 1.13  -0.25   -0.11 1.07 -3.20 2.94  6.14
## Carb Pressure        5 355  0.13 0.96   0.22    0.15 0.91 -2.79 2.78  5.58
## Carb Temp            6 355 -0.01 0.99  -0.03   -0.02 0.94 -2.96 3.11  6.07
## PSC                  7 355 -0.11 0.96  -0.27   -0.20 0.82 -1.58 3.65  5.23
## PSC Fill             8 355  0.00 0.99  -0.14   -0.10 0.99 -1.65 3.21  4.85
## PSC CO2              9 355 -0.02 0.98  -0.39   -0.16 0.66 -1.29 4.08  5.37
## Carb Pressure1      10 355  0.08 0.96   0.16    0.08 0.79 -2.39 2.78  5.17
## Fill Pressure       11 355  0.13 1.05  -0.04    0.05 0.91 -3.61 3.84  7.45
## Hyd Pressure4       12 355  0.36 1.15   0.27    0.31 1.24 -2.63 3.41  6.04
## Temperature         13 355  0.22 1.15  -0.14    0.06 0.82 -1.24 7.03  8.26
## Usage cont          14 355  0.02 1.02   0.34    0.11 1.00 -2.78 1.12  3.89
## Carb Flow           15 355 -0.04 1.08   0.52    0.09 0.35 -2.27 2.45  4.72
## Density             16 355  0.78 0.70   1.08    0.88 0.31 -1.52 1.55  3.07
## MFR                 17 355 -0.03 1.06   0.32    0.22 0.16 -5.49 1.56  7.05
## Pressure Vacuum     18 355 -0.11 0.94  -0.32   -0.18 1.04 -2.43 2.48  4.91
## Oxygen Filler       19 355 -0.08 1.03  -0.34   -0.28 0.44 -0.93 7.19  8.12
## Bowl Setpoint       20 355  0.02 1.05   0.70    0.08 0.97 -2.57 2.00  4.57
## Pressure Setpoint   21 355  0.08 1.04  -0.79    0.10 1.45 -1.77 2.15  3.92
## Air Pressurer       22 355 -0.07 0.72  -0.19   -0.16 0.49 -1.51 3.44  4.95
## Carb Rel            23 355  0.40 0.85   0.48    0.43 0.68 -2.59 4.78  7.37
##                    skew kurtosis   se
## PH                -0.27    -0.37 0.05
## Carb Volume       -0.73     0.97 0.05
## Fill Ounces        0.31     1.07 0.05
## PC Volume          0.35     0.02 0.06
## Carb Pressure     -0.19     0.16 0.05
## Carb Temp          0.07     0.09 0.05
## PSC                1.00     1.13 0.05
## PSC Fill           0.88     0.42 0.05
## PSC CO2            1.55     3.13 0.05
## Carb Pressure1    -0.01    -0.05 0.05
## Fill Pressure      0.62     1.16 0.06
## Hyd Pressure4      0.29    -0.02 0.06
## Temperature        2.31     8.01 0.06
## Usage cont        -0.60    -0.98 0.05
## Carb Flow         -0.84    -0.75 0.06
## Density           -1.21     0.16 0.04
## MFR               -2.78     7.55 0.06
## Pressure Vacuum    0.59     0.24 0.05
## Oxygen Filler      3.57    17.14 0.05
## Bowl Setpoint     -0.70    -0.42 0.06
## Pressure Setpoint  0.04    -1.62 0.06
## Air Pressurer      2.47     8.38 0.04
## Carb Rel           0.27     3.84 0.05

Brand Code = “B”

Let’s have a data description and data visualization for Brand Code = “B”.

# Data visualization and description for Brand Code = "B".
plot_num(B, bins = 10)

describe.by(B)
##                   vars    n  mean   sd median trimmed  mad   min  max
## PH                   1 1264  0.11 0.99   0.08    0.14 1.20 -3.15 3.72
## Carb Volume          2 1264 -0.54 0.54  -0.53   -0.54 0.46 -3.09 2.71
## Fill Ounces          3 1264  0.04 1.00   0.05    0.05 0.98 -3.78 3.80
## PC Volume            4 1264  0.08 0.94   0.00    0.04 0.78 -3.02 3.23
## Carb Pressure        5 1264 -0.27 0.92  -0.34   -0.31 0.83 -3.13 3.12
## Carb Temp            6 1264  0.01 1.03  -0.08   -0.03 1.01 -3.06 3.15
## PSC                  7 1264  0.04 1.00  -0.11   -0.05 0.94 -1.62 3.61
## PSC Fill             8 1264 -0.01 0.99  -0.14   -0.11 0.99 -1.65 3.54
## PSC CO2              9 1264  0.02 1.03  -0.39   -0.17 0.66 -1.29 4.08
## Carb Pressure1      10 1264 -0.02 1.00   0.08   -0.04 0.98 -3.22 3.45
## Fill Pressure       11 1264  0.07 0.95  -0.20    0.01 1.05 -3.77 3.47
## Hyd Pressure4       12 1264  0.27 0.70   0.12    0.17 0.44 -3.32 3.12
## Temperature         13 1264 -0.04 0.87  -0.27   -0.18 0.41 -1.37 6.75
## Usage cont          14 1264 -0.01 0.99   0.24    0.08 1.09 -2.99 1.65
## Carb Flow           15 1264  0.06 0.97   0.53    0.19 0.29 -2.27 1.27
## Density             16 1264 -0.70 0.31  -0.67   -0.67 0.16 -2.47 1.77
## MFR                 17 1264  0.02 0.97   0.34    0.27 0.14 -5.49 1.52
## Pressure Vacuum     18 1264  0.11 1.03   0.03    0.06 1.04 -2.43 2.84
## Oxygen Filler       19 1264  0.02 1.03  -0.28   -0.16 0.50 -0.93 7.19
## Bowl Setpoint       20 1264 -0.10 1.05   0.70    0.06 0.00 -2.57 2.00
## Pressure Setpoint   21 1264  0.16 0.99   0.19    0.16 1.45 -1.77 2.15
## Air Pressurer       22 1264  0.10 1.14  -0.19   -0.14 0.49 -1.68 4.43
## Carb Rel            23 1264 -0.57 0.51  -0.59   -0.58 0.46 -3.20 3.78
##                   range  skew kurtosis   se
## PH                 6.87 -0.21    -0.21 0.03
## Carb Volume        5.80  0.27     3.06 0.02
## Fill Ounces        7.59 -0.10     1.02 0.03
## PC Volume          6.26  0.33     0.90 0.03
## Carb Pressure      6.25  0.48     0.72 0.03
## Carb Temp          6.20  0.34     0.24 0.03
## PSC                5.23  0.83     0.59 0.03
## PSC Fill           5.19  0.97     0.89 0.03
## PSC CO2            5.37  1.73     3.25 0.03
## Carb Pressure1     6.66  0.17     0.40 0.03
## Fill Pressure      7.24  0.37     1.20 0.03
## Hyd Pressure4      6.44  1.57     4.96 0.02
## Temperature        8.13  3.36    17.08 0.02
## Usage cont         4.64 -0.51    -1.04 0.03
## Carb Flow          3.55 -1.12    -0.29 0.03
## Density            4.24  0.10    15.32 0.01
## MFR                7.01 -3.50    12.17 0.03
## Pressure Vacuum    5.26  0.43    -0.20 0.03
## Oxygen Filler      8.12  2.95    12.85 0.03
## Bowl Setpoint      4.57 -0.90    -0.41 0.03
## Pressure Setpoint  3.92  0.01    -1.82 0.03
## Air Pressurer      6.11  1.91     2.78 0.03
## Carb Rel           6.99  1.40    11.16 0.01

Brand Code = “C”

Let’s have a data description and data visualization for Brand Code = “C”.

# Data visualization and description for Brand Code = "C".
plot_num(C, bins = 10)

describe.by(C)
##                   vars   n  mean   sd median trimmed  mad   min  max range
## PH                   1 317 -0.75 1.01  -0.61   -0.72 1.03 -3.85 4.70  8.54
## Carb Volume          2 317 -0.66 0.65  -0.66   -0.67 0.56 -3.03 2.34  5.37
## Fill Ounces          3 317  0.12 0.99   0.05    0.12 0.87 -3.12 3.73  6.85
## PC Volume            4 317  0.24 0.97   0.20    0.21 0.84 -3.02 3.26  6.29
## Carb Pressure        5 317 -0.36 0.92  -0.39   -0.39 0.91 -2.40 2.85  5.25
## Carb Temp            6 317 -0.03 1.00  -0.03   -0.04 1.01 -2.80 3.05  5.85
## PSC                  7 317  0.11 1.10  -0.07    0.01 1.00 -1.66 3.65  5.31
## PSC Fill             8 317  0.10 1.08  -0.14   -0.03 0.99 -1.48 3.37  4.85
## PSC CO2              9 317  0.13 1.10  -0.39   -0.06 0.66 -1.29 4.08  5.37
## Carb Pressure1      10 317 -0.11 0.95   0.08   -0.11 0.98 -2.68 3.24  5.92
## Fill Pressure       11 317  0.13 1.17   0.02    0.08 0.91 -4.11 3.59  7.70
## Hyd Pressure4       12 317  0.42 0.97   0.27    0.39 0.89 -1.67 3.41  5.09
## Temperature         13 317  0.54 1.08   0.42    0.42 0.61 -1.37 6.89  8.26
## Usage cont          14 317  0.01 1.02   0.32    0.10 1.02 -2.72 1.12  3.85
## Carb Flow           15 317 -0.12 1.03   0.49   -0.02 0.38 -2.27 1.18  3.45
## Density             16 317 -0.66 0.43  -0.57   -0.62 0.24 -2.47 1.55  4.03
## MFR                 17 317 -0.03 1.02   0.30    0.23 0.15 -5.17 1.40  6.57
## Pressure Vacuum     18 317 -0.17 0.94  -0.32   -0.22 1.04 -1.73 2.48  4.21
## Oxygen Filler       19 317  0.09 0.97  -0.27   -0.05 0.71 -0.92 3.89  4.81
## Bowl Setpoint       20 317  0.17 0.82   0.70    0.31 0.00 -2.57 1.35  3.92
## Pressure Setpoint   21 317  0.21 1.01   1.17    0.27 0.00 -1.77 1.17  2.94
## Air Pressurer       22 317 -0.06 0.98  -0.36   -0.28 0.49 -1.35 3.60  4.95
## Carb Rel            23 317 -0.65 0.66  -0.75   -0.66 0.46 -3.28 4.24  7.52
##                    skew kurtosis   se
## PH                 0.13     2.21 0.06
## Carb Volume        0.15     1.79 0.04
## Fill Ounces        0.08     1.65 0.06
## PC Volume          0.05     1.34 0.05
## Carb Pressure      0.39     0.30 0.05
## Carb Temp          0.02    -0.03 0.06
## PSC                0.85     0.46 0.06
## PSC Fill           0.93     0.47 0.06
## PSC CO2            1.66     2.65 0.06
## Carb Pressure1     0.02    -0.07 0.05
## Fill Pressure      0.26     0.93 0.07
## Hyd Pressure4      0.42     0.25 0.05
## Temperature        2.54    10.37 0.06
## Usage cont        -0.56    -1.02 0.06
## Carb Flow         -0.69    -1.15 0.06
## Density           -0.29     4.89 0.02
## MFR               -3.19    10.47 0.06
## Pressure Vacuum    0.49    -0.04 0.05
## Oxygen Filler      1.29     1.26 0.05
## Bowl Setpoint     -1.32     0.91 0.05
## Pressure Setpoint -0.26    -1.63 0.06
## Air Pressurer      2.43     5.32 0.06
## Carb Rel           0.93    10.76 0.04

Brand Code = “D”

Let’s have a data description and data visualization for Brand Code = “D”.

# Data visualization and description for Brand Code = "D".
plot_num(D, bins = 10)

describe.by(D)
##                   vars   n  mean   sd median trimmed  mad   min  max range
## PH                   1 635  0.31 0.80   0.31    0.32 0.86 -2.46 2.16  4.62
## Carb Volume          2 635  1.23 0.69   1.28    1.28 0.46 -2.81 3.08  5.90
## Fill Ounces          3 635 -0.20 0.95  -0.17   -0.20 0.87 -3.60 3.80  7.40
## PC Volume            4 635 -0.24 1.01  -0.36   -0.32 0.83 -3.08 2.97  6.05
## Carb Pressure        5 635  0.64 0.89   0.67    0.65 0.83 -2.86 3.12  5.98
## Carb Temp            6 635  0.00 0.94  -0.08   -0.03 0.87 -2.91 3.10  6.01
## PSC                  7 635 -0.07 0.97  -0.19   -0.16 0.94 -1.66 3.54  5.19
## PSC Fill             8 635 -0.03 0.99  -0.14   -0.12 0.99 -1.65 3.37  5.02
## PSC CO2              9 635 -0.08 0.89  -0.39   -0.20 0.66 -1.29 4.08  5.37
## Carb Pressure1      10 635  0.05 1.04   0.20    0.04 0.98 -3.50 3.61  7.12
## Fill Pressure       11 635 -0.28 0.91  -0.53   -0.35 0.27 -3.74 3.72  7.46
## Hyd Pressure4       12 635 -0.95 0.81  -1.23   -1.12 0.22 -3.01 3.26  6.28
## Temperature         13 635 -0.32 0.97  -0.55   -0.44 0.82 -1.65 5.24  6.89
## Usage cont          14 635  0.00 1.01   0.24    0.08 1.11 -2.86 1.61  4.47
## Carb Flow           15 635 -0.04 0.99   0.51    0.09 0.23 -2.27 1.28  3.56
## Density             16 635  1.29 0.55   1.39    1.40 0.24 -2.31 1.98  4.29
## MFR                 17 635 -0.01 1.02   0.34    0.25 0.16 -5.19 1.46  6.65
## Pressure Vacuum     18 635 -0.07 0.98  -0.32   -0.15 1.04 -2.43 2.84  5.26
## Oxygen Filler       19 635 -0.04 0.92  -0.30   -0.20 0.51 -0.93 6.17  7.10
## Bowl Setpoint       20 635  0.10 0.93   0.70    0.22 0.00 -2.57 2.00  4.57
## Pressure Setpoint   21 635 -0.46 0.83  -0.79   -0.53 0.00 -1.77 2.15  3.92
## Air Pressurer       22 635 -0.12 0.80  -0.19   -0.25 0.49 -1.51 3.77  5.28
## Carb Rel            23 635  1.24 0.66   1.25    1.31 0.46 -3.66 2.78  6.45
##                    skew kurtosis   se
## PH                -0.16    -0.40 0.03
## Carb Volume       -1.41     4.89 0.03
## Fill Ounces        0.03     0.87 0.04
## PC Volume          0.65     1.01 0.04
## Carb Pressure     -0.18     0.87 0.04
## Carb Temp          0.29     0.49 0.04
## PSC                0.91     0.82 0.04
## PSC Fill           0.92     0.81 0.04
## PSC CO2            1.65     4.16 0.04
## Carb Pressure1     0.07     0.43 0.04
## Fill Pressure      0.95     3.54 0.04
## Hyd Pressure4      2.06     4.17 0.03
## Temperature        1.97     6.31 0.04
## Usage cont        -0.53    -1.02 0.04
## Carb Flow         -0.96    -0.68 0.04
## Density           -2.51     7.39 0.02
## MFR               -3.11     9.65 0.04
## Pressure Vacuum    0.66     0.18 0.04
## Oxygen Filler      2.60     9.95 0.04
## Bowl Setpoint     -1.08     0.46 0.04
## Pressure Setpoint  0.99     0.19 0.03
## Air Pressurer      2.73     8.66 0.03
## Carb Rel          -2.20     9.32 0.03

Though the histograms certainly provides a little bit more detail in regards to the differences that exist, attempting to identify what processes constitute each brand can still remain challenging. We can attempt to identify some differences; A quick visualization shows that Carb Volume is drastically different from each brand. However, performing an analysis in this format is time consuming and subjective.

Can we calculate the differences and see which medians are the most different from each other?

Let’s take a look at our data to see if could answer to that question, thus we will find the median for each potential predictor by “Brand Code” from the transformed Data (Center and Scaled).

# Procedure to extract the median for each potential predictor by "Brand Code" 
# from the transformed Data (Center and Scaled).
mean.ByGroup <- train.df.transformed %>% 
                      group_by(`Brand Code`) %>% 
                      summarise(pHMedian = median(PH),
                                CarbVolumeMedian = median(`Carb Volume`),
                                FillOuncesMedian = median(`Fill Ounces`),
                                PCVolumeMedian = median(`PC Volume`),
                                CarbPressureMedian = median(`Carb Pressure`),
                                CarbTempMedian = median(`Carb Temp`),
                                PSCMedian = median(PSC),
                                PSCFillMedian = median(`PSC Fill`),
                                PSCCO2Median = median(`PSC CO2`),
                                CarbPressure1Median = median(`Carb Pressure1`),
                                FillPressureMedian = median(`Fill Pressure`),
                                HydPressure4Median = median(`Hyd Pressure4`),
                                TemperatureMedian = median(Temperature),
                                UsageContMedian = median(`Usage cont`),
                                CarbFlowMedian = median(`Carb Flow`),
                                DensityMedian = median(Density),
                                MFRMedian = median(MFR),
                                PresureVaccuumMedian = median(`Pressure Vacuum`),
                                OxygenFillerMedian = median(`Oxygen Filler`),
                                BowlSetpointMedian = median(`Bowl Setpoint`),
                                PressureSetpointMedian = median(`Pressure Setpoint`),
                                AirPressurerMedian = median(`Air Pressurer`),
                                CarbRelMedian = median(`Carb Rel`))

t(mean.ByGroup)
##                               [,1]
## pHMedian               -0.03453844
## CarbVolumeMedian       -0.22208492
## FillOuncesMedian       -0.02472075
## PCVolumeMedian         -0.10077236
## CarbPressureMedian     -0.00354290
## CarbTempMedian         -0.07881764
## PSCMedian              -0.14979968
## PSCFillMedian          -0.14069270
## PSCCO2Median           -0.39486165
## CarbPressure1Median     0.11700803
## FillPressureMedian     -0.47260074
## HydPressure4Median     -0.02792894
## TemperatureMedian      -0.27275354
## UsageContMedian         0.27072472
## CarbFlowMedian          0.52092696
## DensityMedian          -0.51243146
## MFRMedian               0.33103451
## PresureVaccuumMedian   -0.32263067
## OxygenFillerMedian     -0.29299645
## BowlSetpointMedian      0.69746218
## PressureSetpointMedian -0.79289662
## AirPressurerMedian     -0.19307804
## CarbRelMedian          -0.28712938

Pivoting the tables does not seem to provide any obvious additional insight.

An analysis that could provide much better feedback is the Principal Component Analysis (PCA). PCA attempts to identify variables that provides the largest variances and subsequently builds principal components. Often, these components contains a linear combination of different variables that are typically highly correlated among each other. By visualizing these principal components, the differences may begin to emerge. PCA will be employed here and utilized to help provide further insight.

Principal Component Analysis

Below is the construction of the principal components.

# Principal Component
# Reference: https://www.datacamp.com/community/tutorials/pca-analysis-r#interpret
df.train.pca <- prcomp(train.df.transformed[,-2], center = TRUE, scale. = TRUE)
summary(df.train.pca)
## Importance of components:
##                           PC1    PC2     PC3    PC4     PC5     PC6
## Standard deviation     1.8715 1.7400 1.38190 1.2806 1.16670 1.13377
## Proportion of Variance 0.1523 0.1316 0.08303 0.0713 0.05918 0.05589
## Cumulative Proportion  0.1523 0.2839 0.36695 0.4382 0.49743 0.55332
##                            PC7     PC8    PC9    PC10    PC11    PC12
## Standard deviation     1.10652 1.00990 0.9899 0.96132 0.87946 0.87017
## Proportion of Variance 0.05323 0.04434 0.0426 0.04018 0.03363 0.03292
## Cumulative Proportion  0.60656 0.65090 0.6935 0.73368 0.76731 0.80023
##                           PC13    PC14   PC15    PC16   PC17    PC18
## Standard deviation     0.83845 0.83167 0.7924 0.71573 0.7015 0.66808
## Proportion of Variance 0.03057 0.03007 0.0273 0.02227 0.0214 0.01941
## Cumulative Proportion  0.83080 0.86087 0.8882 0.91044 0.9318 0.95125
##                           PC19    PC20    PC21    PC22    PC23
## Standard deviation     0.59160 0.53480 0.48619 0.40449 0.29217
## Proportion of Variance 0.01522 0.01244 0.01028 0.00711 0.00371
## Cumulative Proportion  0.96646 0.97890 0.98918 0.99629 1.00000

There are 23 principal components. The first two components explain approximately 28.12% of the variance in the data. Though the first two components may not cover the majority of the variance, it still contains a fairly reasonable amount of variance. Below are plots with the 1st principal component on the X axis and the 2nd principal component on the Y axis.

Plotting PCA

Let’s have a better understanding for these relationships by labeling the data points for each beverage. with that in mind, we can better understand where each beverage type falls into.

# Principal Component Analisys for the first two components.
# Uncomment out the first two lines to download the ggbiplot package
#library(devtools)
#install_github("vqv/ggbiplot")
#library(ggbiplot)

#ggbiplot(df.train.pca, alpha = 0.2, groups=train.df.transformed$`Brand Code`)

ggbiplot(df.train.pca, alpha = 0.2, obs.scale = 1, var.scale = 1,
  groups = train.df.transformed$`Brand Code`, ellipse = TRUE, circle = FALSE) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top')

The arrow labels above demonstrate how each variable contributes to each of the principal components. As we can see, Carb Volume, Carb Pressure, Density, Carb Rel, Carb Temp all contribute fairly significantly to the first component, whereas Carb Pressure, Pressure Vacuum, Carb Flow, and Oxygen Filler contributes to the second component. (The authors acknowledge that the variables are not perfectly align the 1st and 2nd PCs, and that the same variables can contribute to multiple PCs.)

Even from the first two principal components, we can already start to see some clustering of the data points with the principal components along with its corresponding variables. It appears that D beverage is more associated with higher Carb Volume, Density, Carb Pressure and Carb Temp, whereas B and C beverages on the opposite spectrum. In fact, B and C appear similar to each other in quality. Beverage A is in between (C & B) and D.

As mentioned prior, K-means clustering appeared to have identified three clusters. The PCA appears to support the notion that there are three clusters of drinks, as B and C are manufactured quite similarly.

Model Evaluation

Classification and predictions are an important process. As each plant may have developed a process for producing a beverage, we have implemented three different supervised machine learning models to predict on which beverage is being produced by the processes. The next section will then assume that if we are aware of the beverage type and the manufacturing processes.

The training data should be further split into a testing and training data set and the evaluation set should be reserved as a validation set. (As the validation data set does not contain PH, we will eliminate the PH column from the training, testing and validation data set.)

Splitting data set

# Splitting the data.
set.seed(123)
train_df <- subset(train.df.transformed, select = -PH)
validation_df <- eval.df.transformed[,!colnames(eval.df.transformed) %in% c('PH', 'Brand Code')]
inTraining <- createDataPartition(train_df$`Brand Code`, p = 0.80, list=FALSE)
training <- train_df[ inTraining,]
testing <- train_df[-inTraining,]
X_train <- subset(training, select = -`Brand Code`)
Y_train <- training$`Brand Code`
X_test <- subset(testing, select = -`Brand Code`)
Y_test <- testing$`Brand Code`

The following procedure will help with the identification of the missing brand.

Because this is a classification problem, there are different algorithms that we can employ. Below are some considerations for modelling.

1. Multinomial Logistic Regression

In the following procedure, we are making use of a Multinomial Logistic Regression Model, in order to predict the brand for missing NA.

# Multinomial Logistic Regression
# Reference: https://www.analyticsvidhya.com/blog/2016/02/multinomial-ordinal-logistic-regression/
# library("nnet")
multi_log_ml <- multinom(`Brand Code` ~ ., data = training)
## # weights:  96 (69 variable)
## initial  value 2852.993795 
## iter  10 value 1199.118233
## iter  20 value 1115.376869
## iter  30 value 1078.168228
## iter  40 value 1073.066268
## iter  50 value 1070.530157
## iter  60 value 1069.893674
## iter  70 value 1069.438744
## iter  80 value 1069.349809
## final  value 1069.349749 
## converged
summary(multi_log_ml)
## Call:
## multinom(formula = `Brand Code` ~ ., data = training)
## 
## Coefficients:
##   (Intercept) `Carb Volume` `Fill Ounces`  `PC Volume` `Carb Pressure`
## B   0.5060005    -0.5647241   -0.10950727 -0.004336197      0.02415049
## C  -0.9150091    -0.6964427    0.08477531  0.041817530      0.04942251
## D  -1.4430434     1.5061885   -0.61872770  0.099215526      0.39482352
##   `Carb Temp`        PSC  `PSC Fill`    `PSC CO2` `Carb Pressure1`
## B  0.16588705 0.09152114  0.11532617 -0.002634021      -0.23046934
## C  0.09980465 0.14479046  0.14096556  0.063002415      -0.43213189
## D -0.14426551 0.13085850 -0.04141122  0.118713598       0.06081384
##   `Fill Pressure` `Hyd Pressure4` Temperature `Usage cont`  `Carb Flow`
## B    -0.263078932       0.1747994  -0.9889559   0.25678596 -0.092434398
## C    -0.148574010       0.3252315  -0.1424925   0.30436009 -0.212046653
## D    -0.003899004      -1.3540677  -0.3744605  -0.03719576  0.000615723
##      Density        MFR `Pressure Vacuum` `Oxygen Filler` `Bowl Setpoint`
## B -4.3664194  0.1105132       0.360369027      0.15321372     -0.66776446
## C -3.2845065  0.2585239      -0.128495924      0.09280109      0.07592927
## D -0.4660494 -0.7013293       0.004375973      0.23630598      0.08917450
##   `Pressure Setpoint` `Air Pressurer` `Carb Rel`
## B           0.1934110      0.04963842  0.7314043
## C           0.3980602      0.01400873  0.2092435
## D          -0.6267293      0.15254569  0.4723883
## 
## Std. Errors:
##   (Intercept) `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## B   0.1687133     0.2320475     0.1238062   0.1373272       0.3024096
## C   0.2129957     0.2499761     0.1318608   0.1464332       0.3200804
## D   0.2363374     0.2073748     0.1264027   0.1388084       0.1867040
##   `Carb Temp`       PSC `PSC Fill` `PSC CO2` `Carb Pressure1`
## B   0.2751451 0.1257192  0.1251929 0.1179624        0.1393798
## C   0.2925728 0.1340164  0.1344745 0.1256101        0.1514360
## D   0.1670318 0.1293131  0.1163577 0.1164359        0.1390565
##   `Fill Pressure` `Hyd Pressure4` Temperature `Usage cont` `Carb Flow`
## B       0.1373903       0.1462058   0.1254562    0.1490657   0.1500267
## C       0.1463050       0.1596751   0.1176269    0.1596341   0.1602519
## D       0.1234059       0.1332298   0.1233084    0.1348158   0.1419408
##     Density       MFR `Pressure Vacuum` `Oxygen Filler` `Bowl Setpoint`
## B 0.2951186 0.1272432         0.1458799       0.1254977       0.1866479
## C 0.3145246 0.1407770         0.1585295       0.1300297       0.2005784
## D 0.2317276 0.1171085         0.1321612       0.1276121       0.1563752
##   `Pressure Setpoint` `Air Pressurer` `Carb Rel`
## B           0.1552665       0.1372601  0.2360161
## C           0.1704289       0.1509435  0.2510999
## D           0.1480242       0.1595693  0.2207955
## 
## Residual Deviance: 2138.699 
## AIC: 2276.699

The model execution output shows some iteration history and includes the Residual Deviance and AIC as seen above.

Each block has one row of values corresponding to one model equation. In the block of coefficients, we see that the first row is being compared to B to our baseline A and the second row to C to our baseline A, and so forth.

A one-unit increase in Carb Volume increases the log odds of being in B by the given value in the above results vs of being in A.

Let’s calculate the Z score and p-value for the variables in the model.

# Z-scores and p-values
z <- summary(multi_log_ml)$coefficients/summary(multi_log_ml)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2
t(round(p,4))
##                          B      C      D
## (Intercept)         0.0027 0.0000 0.0000
## `Carb Volume`       0.0149 0.0053 0.0000
## `Fill Ounces`       0.3764 0.5203 0.0000
## `PC Volume`         0.9748 0.7752 0.4748
## `Carb Pressure`     0.9363 0.8773 0.0345
## `Carb Temp`         0.5466 0.7330 0.3878
## PSC                 0.4666 0.2800 0.3116
## `PSC Fill`          0.3570 0.2945 0.7219
## `PSC CO2`           0.9822 0.6160 0.3079
## `Carb Pressure1`    0.0982 0.0043 0.6619
## `Fill Pressure`     0.0555 0.3099 0.9748
## `Hyd Pressure4`     0.2319 0.0417 0.0000
## Temperature         0.0000 0.2257 0.0024
## `Usage cont`        0.0850 0.0566 0.7826
## `Carb Flow`         0.5378 0.1858 0.9965
## Density             0.0000 0.0000 0.0443
## MFR                 0.3851 0.0663 0.0000
## `Pressure Vacuum`   0.0135 0.4176 0.9736
## `Oxygen Filler`     0.2221 0.4754 0.0641
## `Bowl Setpoint`     0.0003 0.7050 0.5685
## `Pressure Setpoint` 0.2129 0.0195 0.0000
## `Air Pressurer`     0.7176 0.9261 0.3391
## `Carb Rel`          0.0019 0.4047 0.0324

Listed above are the p-values.

Testing on data set

Let’s do a test on the evaluation data set.

# Testing against data set.
predicted <- predict(multi_log_ml, X_test)
table(predicted)
## predicted
##   A   B   C   D 
##  62 305  20 126

Confusion Matrix

The next step will be to calculate the Confusion Matrix and Misclassification Error as follows:

# Confusion Matrix
# Reference: http://r-statistics.co/Multinomial-Regression-With-R.html
table(predicted, Y_test)
##          Y_test
## predicted   A   B   C   D
##         A  51   5   2   4
##         B   8 236  55   6
##         C   4   9   6   1
##         D   8   2   0 116
cat("Classification Error: ", mean(as.character(predicted) != as.character(Y_test)))
## Classification Error:  0.202729

From the above results, we notice that there’s a moderate misclassification error.

2. Random Forest

Random Forests is a tree-based ensemble machine learning algorithm that has time to time proven to be successful. Three ntrees values will be tested here with 100, 500 and 700 trees.

# Procedure to find best suited Random Forest model
ntrees <- c(100,500,700)
rf_class_error <- c()
i <- 1
for (j in ntrees){
  rf_model <- randomForest(x = X_train, y = as.factor(Y_train), ntree = j)
  predicted <- predict(rf_model, X_test)
  rf_class_error[i] <- mean(as.character(predicted) != as.character(Y_test))
  i <- i + 1
}
rf_df <- data.frame(ntrees, rf_class_error)
rf_df
##   ntrees rf_class_error
## 1    100      0.1247563
## 2    500      0.1189084
## 3    700      0.1208577

With 500 trees in the Random Forest, the misclassification error is significant better than the multinomial logistic regression.

Testing on data set

Let’s do a test on the evaluation data set.

# Testing against data set.
rf_model <- randomForest(x = X_train, y = as.factor(Y_train), ntree = 500)
predicted <- predict(rf_model, X_test)
table(predicted)
## predicted
##   A   B   C   D 
##  64 292  32 125

Confusion Matrix

The next step will be to calculate the Confusion Matrix and Misclassification Error as follows:

# Random Forest model classification error.
table(predicted, Y_test)
##          Y_test
## predicted   A   B   C   D
##         A  58   3   0   3
##         B   9 245  36   2
##         C   1   2  27   2
##         D   3   2   0 120
cat("Classification Error: ", mean(as.character(predicted) != as.character(Y_test)))
## Classification Error:  0.122807

Variable Importance

Random Forest also provides a method to provide the most important variables that were utilized in determining classification.

# Variable importance
rf_varimp <- varImp(rf_model)
rf_varimp
##                     Overall
## Carb Volume       160.57835
## Fill Ounces        29.10788
## PC Volume          34.33603
## Carb Pressure      41.25973
## Carb Temp          26.74886
## PSC                24.31229
## PSC Fill           21.34152
## PSC CO2            17.71011
## Carb Pressure1     29.40786
## Fill Pressure      31.20974
## Hyd Pressure4     164.74855
## Temperature        74.06320
## Usage cont         28.32962
## Carb Flow          36.17104
## Density           285.22532
## MFR                39.97453
## Pressure Vacuum    24.78416
## Oxygen Filler      32.15890
## Bowl Setpoint      17.94898
## Pressure Setpoint  17.19494
## Air Pressurer      22.51586
## Carb Rel          204.78156
varImpPlot(rf_model)

From above, we notice how Density is very important for this model evaluation.

3. Boosted Tree Ensemble via XGBoost

XGBoost, otherwise known as extreme gradient boosting, has been a popular gradient tree-based ensemble algorithm utilized on many Kaggle competitions. Known for its computational efficiency and parallelization, while obtaining highly accurate results, many machine learning practitioners and data scientists will often trial this algorithm initially.

# Boosted Tree Ensemble via XGBoost
# To run this comment, please change to include=TRUE
# XGboost works with using the xgb.DMatrix function
X_train = xgb.DMatrix(as.matrix(X_train))
X_test = xgb.DMatrix(as.matrix(X_test))
# Creating a cross validation control
xgb_trcontrol = trainControl(
  method = "cv",
  number = 5,  
  allowParallel = TRUE,
  verboseIter = FALSE,
  returnData = FALSE
)
# Setting up a grid search for the best parameters
xgbGrid <- expand.grid(nrounds = c(100,200),  # this is n_estimators above
                       max_depth = c(10, 15, 20, 25),
                       colsample_bytree = seq(0.5, 0.9, length.out = 5),
                       ## The values below are default values in the sklearn-api. 
                       eta = 0.1,
                       gamma=0,
                       min_child_weight = 1,
                       subsample = 1
                      )

set.seed(123) 

xgb_model = train(
  X_train, as.factor(Y_train),  
  trControl = xgb_trcontrol,
  tuneGrid = xgbGrid,
  method = "xgbTree"
)

As an alternative, will be running a saved model.

# The XGB_model is saved and will be loaded for time's sake
saveRDS(xgb_model, "./xgb_model_project2.rds")
xgb_model <- readRDS("./xgb_model_project2.rds")

Let’s see the given results.

Testing on data set

Let’s do a test on the evaluation data set.

# Testing against data set.
# XGboost results
predicted <- predict(xgb_model, X_test)
table(predicted)
## predicted
##   A   B   C   D 
##  66 284  37 126

Confusion Matrix

The next step will be to calculate the Confusion Matrix and Misclassification Error as follows:

# XG Boost model classification error.
table(predicted, Y_test)
##          Y_test
## predicted   A   B   C   D
##         A  57   4   1   4
##         B   8 241  32   3
##         C   1   5  30   1
##         D   5   2   0 119
cat("Classification Error: ", mean(as.character(predicted) != as.character(Y_test)))
## Classification Error:  0.128655

XG Boost appeared to perform the best out of all the models here with a misclassification rate much better than the results given above for the other models.

Variable Importance

XG Boost offers a way to identify the most important variables.

# XG Boost procediure to identofy the variable importance
xgb_varimp <- varImp(xgb_model)
xgb_varimp
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 22)
## 
##     Overall
## 14 100.0000
## 21  99.1069
## 0   39.5772
## 10  31.4315
## 11  18.2775
## 15  12.9339
## 13   8.0870
## 3    7.8391
## 2    7.6349
## 4    7.1605
## 17   7.0896
## 1    5.8702
## 16   5.4529
## 9    4.8134
## 8    4.7695
## 20   4.7663
## 12   3.8730
## 5    2.4446
## 18   2.3877
## 6    0.7867
# Selected model XGB Model
final_predicted <- predict(xgb_model, validation_df)
table(eval.df.transformed$`Brand Code`, final_predicted)
##    final_predicted
##       A   B   C   D
##   A  33   2   1   2
##   B   3 124   2   1
##   C   2  12  18   0
##   D   3   1   2  61
cat("Classification Error: ", mean(as.character(final_predicted) != as.character(eval.df.transformed$`Brand Code`)))
## Classification Error:  0.1161049

Let’s take a look at the predicted Brand Codes.

# Writing the predictions as well as the correct results
csv_submission <- data.frame(observed = eval.df$`Brand Code`, predicted = final_predicted, random = eval.df.transformed$`Brand Code`)
#head(csv_submission)
csv_submission[is.na(csv_submission$observed),]
##     observed predicted random
## 16      <NA>         C      A
## 127     <NA>         A      C
## 167     <NA>         B      A
## 209     <NA>         C      D
## 210     <NA>         A      D
## 236     <NA>         A      A
## 256     <NA>         B      B
## 257     <NA>         C      D
# Uncomment this section if you like to create a csv file
# write.csv(csv_submission, "csv_submission.csv")

As you can see our model has predicted some possible Brand Codes in order to fill the missing values from our given data set. Next to it, we have added an extra column for the random inputs that were generated as an alternative approach.

Predicting pH

The above demonstrated these models ability to classify different brands. However, the goal here is to predict the PH based on the information provided. As noted, we will be utilizing the three different machine learning models above to predict on the pH.

Prior to fitting the data, we must convert the one categorical variable and hot encode the factors. This will eliminate the challenges of having categorical variables and instead transform the Brand Code data into 1s or 0s, depending on what brand category the samples belong to.

#Splitting the data.
set.seed(123)
train_df <- train.df.transformed
validation_df <- eval.df.transformed
# Training
Brand_Code <- dummy_cols(train_df[,'Brand Code'])
Brand_Code <- subset(Brand_Code, select = -`Brand Code`)
train_df <- subset(train_df, select = -`Brand Code`)
train_df <- cbind(train_df, Brand_Code)
# Testing
Brand_Code <- dummy_cols(validation_df[,'Brand Code'])
Brand_Code <- subset(Brand_Code, select = -`Brand Code`)
validation_df<- subset(validation_df, select = -`Brand Code`)
validation_df <- cbind(validation_df, Brand_Code)

Similarly, the data set will be split into training, testing and validation data set.

# Procedure to create a train control data-set.
inTraining <- createDataPartition(train_df$PH, p = 0.80, list=FALSE)
training <- train_df[ inTraining,]
testing <- train_df[-inTraining,]
X_train <- subset(training, select = -PH)
Y_train <- training$PH
X_test <- subset(testing, select = -PH)
Y_test <- testing$PH
# Defining the train control
set.seed(123) 
train.control <- trainControl(method = "cv", number = 10)

1. Multiple Linear Regression

Ordinary Least Squares Regression with Backward Stepwise is a simple linear model that can be initially fitted to the data set. As this model tends to be interpretable and easy to implement with little to no parameters to tune, multivariate linear regression is our first natural model to utilize.

lm_mod <- lm(PH ~ .,
             data = training,
             seed = 29)
summary(lm_mod)
## 
## Call:
## lm(formula = PH ~ ., data = training, seed = 29)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4026 -0.5259  0.0906  0.5654  4.5770 
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.214685   0.070061   3.064  0.00221 ** 
## `Carb Volume`       -0.046116   0.039759  -1.160  0.24623    
## `Fill Ounces`       -0.037184   0.019057  -1.951  0.05116 .  
## `PC Volume`         -0.045928   0.021013  -2.186  0.02895 *  
## `Carb Pressure`      0.009164   0.046451   0.197  0.84362    
## `Carb Temp`          0.014109   0.042918   0.329  0.74239    
## PSC                 -0.054565   0.019376  -2.816  0.00491 ** 
## `PSC Fill`          -0.015999   0.019023  -0.841  0.40042    
## `PSC CO2`           -0.040160   0.018612  -2.158  0.03106 *  
## `Carb Pressure1`     0.149954   0.022378   6.701 2.67e-11 ***
## `Fill Pressure`      0.059391   0.024976   2.378  0.01750 *  
## `Hyd Pressure4`     -0.012071   0.026211  -0.461  0.64518    
## Temperature         -0.092477   0.020826  -4.440 9.46e-06 ***
## `Usage cont`        -0.163993   0.022656  -7.238 6.41e-13 ***
## `Carb Flow`          0.108931   0.024035   4.532 6.18e-06 ***
## Density             -0.052890   0.046110  -1.147  0.25150    
## MFR                 -0.054228   0.021522  -2.520  0.01182 *  
## `Pressure Vacuum`    0.006486   0.021662   0.299  0.76463    
## `Oxygen Filler`      0.080283   0.020223   3.970 7.44e-05 ***
## `Bowl Setpoint`      0.303672   0.024681  12.304  < 2e-16 ***
## `Pressure Setpoint` -0.132143   0.026411  -5.003 6.12e-07 ***
## `Air Pressurer`     -0.008830   0.019032  -0.464  0.64275    
## `Carb Rel`           0.059186   0.036122   1.638  0.10148    
## `Brand Code_B`      -0.079771   0.100023  -0.798  0.42524    
## `Brand Code_A`      -0.474008   0.077367  -6.127 1.07e-09 ***
## `Brand Code_C`      -0.884854   0.109963  -8.047 1.43e-15 ***
## `Brand Code_D`             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8239 on 2032 degrees of freedom
## Multiple R-squared:  0.3279, Adjusted R-squared:  0.3197 
## F-statistic: 39.66 on 25 and 2032 DF,  p-value: < 2.2e-16

From above, we have our FULL model. Let’s create a Stepwise Backwards model.

# Stepwise Backwards
step.model <- stepAIC(lm_mod, direction = "backward", trace=FALSE)
summary(step.model)
## 
## Call:
## lm(formula = PH ~ `Carb Volume` + `Fill Ounces` + `PC Volume` + 
##     PSC + `PSC CO2` + `Carb Pressure1` + `Fill Pressure` + Temperature + 
##     `Usage cont` + `Carb Flow` + MFR + `Oxygen Filler` + `Bowl Setpoint` + 
##     `Pressure Setpoint` + `Carb Rel` + `Brand Code_A` + `Brand Code_C`, 
##     data = training, seed = 29)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3083 -0.5313  0.0910  0.5704  4.6287 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.16707    0.02146   7.786 1.09e-14 ***
## `Carb Volume`       -0.05100    0.02998  -1.701  0.08908 .  
## `Fill Ounces`       -0.03720    0.01877  -1.981  0.04770 *  
## `PC Volume`         -0.04582    0.02024  -2.264  0.02369 *  
## PSC                 -0.05845    0.01882  -3.107  0.00192 ** 
## `PSC CO2`           -0.04308    0.01826  -2.359  0.01842 *  
## `Carb Pressure1`     0.14880    0.02220   6.702 2.64e-11 ***
## `Fill Pressure`      0.06047    0.02482   2.436  0.01493 *  
## Temperature         -0.09142    0.02042  -4.477 7.98e-06 ***
## `Usage cont`        -0.16302    0.02223  -7.333 3.23e-13 ***
## `Carb Flow`          0.10765    0.02300   4.680 3.05e-06 ***
## MFR                 -0.05283    0.01984  -2.663  0.00782 ** 
## `Oxygen Filler`      0.07927    0.02002   3.960 7.76e-05 ***
## `Bowl Setpoint`      0.30953    0.02351  13.166  < 2e-16 ***
## `Pressure Setpoint` -0.13310    0.02595  -5.130 3.17e-07 ***
## `Carb Rel`           0.05708    0.03021   1.889  0.05903 .  
## `Brand Code_A`      -0.46892    0.05500  -8.525  < 2e-16 ***
## `Brand Code_C`      -0.81831    0.05844 -14.002  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.823 on 2040 degrees of freedom
## Multiple R-squared:  0.3266, Adjusted R-squared:  0.321 
## F-statistic: 58.21 on 17 and 2040 DF,  p-value: < 2.2e-16

The adjusted R-squared value is 0.342. In other words, the OLS model explains approximately 34.8% of the variance in the data. However, in order to compare the models, we will be utilizing the test RMSE metrics.

ANOVA

Let’s take a look at the Analysis of Variance results.

# ANOVA results for Stepwise Backwards model
step.model$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## PH ~ `Carb Volume` + `Fill Ounces` + `PC Volume` + `Carb Pressure` + 
##     `Carb Temp` + PSC + `PSC Fill` + `PSC CO2` + `Carb Pressure1` + 
##     `Fill Pressure` + `Hyd Pressure4` + Temperature + `Usage cont` + 
##     `Carb Flow` + Density + MFR + `Pressure Vacuum` + `Oxygen Filler` + 
##     `Bowl Setpoint` + `Pressure Setpoint` + `Air Pressurer` + 
##     `Carb Rel` + `Brand Code_B` + `Brand Code_A` + `Brand Code_C` + 
##     `Brand Code_D`
## 
## Final Model:
## PH ~ `Carb Volume` + `Fill Ounces` + `PC Volume` + PSC + `PSC CO2` + 
##     `Carb Pressure1` + `Fill Pressure` + Temperature + `Usage cont` + 
##     `Carb Flow` + MFR + `Oxygen Filler` + `Bowl Setpoint` + `Pressure Setpoint` + 
##     `Carb Rel` + `Brand Code_A` + `Brand Code_C`
## 
## 
##                   Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1                                         2032   1379.256 -771.5923
## 2     - `Brand Code_D`  0 0.00000000      2032   1379.256 -771.5923
## 3    - `Carb Pressure`  1 0.02641861      2033   1379.282 -773.5529
## 4  - `Pressure Vacuum`  1 0.06195459      2034   1379.344 -775.4604
## 5    - `Air Pressurer`  1 0.11662106      2035   1379.461 -777.2864
## 6    - `Hyd Pressure4`  1 0.14380658      2036   1379.605 -779.0719
## 7         - `PSC Fill`  1 0.47518419      2037   1380.080 -780.3632
## 8     - `Brand Code_B`  1 0.61970044      2038   1380.699 -781.4393
## 9            - Density  1 0.35976253      2039   1381.059 -782.9031
## 10       - `Carb Temp`  1 0.82959849      2040   1381.889 -783.6673

Predict pH

In this section, we will predict the pH values from our previous generated model.

# Procedure to calculate predicted values.
predicted <- predict(step.model, X_test)
residuals <- Y_test - predicted

RMSE

In this section, we will present the RMSE results.

# Procedure to calculate RMSE
RMSE <- sqrt(mean(residuals^2))

y_test_mean = mean(Y_test)
# Calculate total sum of squares
tss =  sum((Y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)

# Rounding
RMSE <- round(RMSE,4)
rsq <- round(rsq,4)

paste0("OLS Regression R-Squard Value: ", rsq)
## [1] "OLS Regression R-Squard Value: 0.3512"
paste0("OLS Regression RMSE: ", RMSE)
## [1] "OLS Regression RMSE: 0.8091"

2. Random Forest

Random Forests is a tree-based ensemble machine learning algorithm that has time to time proven to be successful. Three ntrees values will be tested here with 100, 500 and 700 trees.

# Procedure to find Random Forest Model

# Function to Calculate RMSE
RMSE = function(m, o){
  sqrt(mean((m - o)^2))
}

# Random Forest definitions
ntrees <- c(100,500,700)
rf_RMSE <- c()
i <- 1
for (j in ntrees){
  rf_model <- randomForest(x = X_train, y = Y_train, ntree = j)
  predicted <- predict(rf_model, X_test)
  rf_RMSE[i] <- RMSE(predicted, Y_train)
  i <- i + 1
}
rf_df <- data.frame(ntrees, rf_RMSE)
rf_df
##   ntrees  rf_RMSE
## 1    100 1.234441
## 2    500 1.227010
## 3    700 1.229064

From the above table, we can conclude that our best model will be the one with 700 trees.

# Random Forest selected Model
rf_model <- randomForest(x = X_train, y = Y_train, ntree = 700)

Predict pH

In this section, we will predict the pH values from our previous generated model.

# Procedure to calculate predicted values.
predicted <- predict(rf_model, X_test)
residuals <- Y_test - predicted

RMSE

In this section, we will present the RMSE results.

# Procedure to calculate RMSE
RMSE <- sqrt(mean(residuals^2))

y_test_mean = mean(Y_test)
# Calculate total sum of squares
tss =  sum((Y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)

# Rounding
RMSE <- round(RMSE,4)
rsq <- round(rsq,4)

paste0("Random Forest R-Squared Value: ", rsq)
## [1] "Random Forest R-Squared Value: 0.6429"
paste0("Random Forest RMSE: ", RMSE)
## [1] "Random Forest RMSE: 0.6003"

The Random Forest R-squared (assuming 700 trees) is much better than the previous model, the RMSE was somewhat better than the OLS as well.

For now, we will consider the Random Forest as the better model compared to OLS given the higher r-squared value.

3. Gradient Boosted Tree Models

As noted above, gradient boosted trees are a great algorithm to implement. The model will be implemented here and compared to the other models.

# To run this comment, please change to include=TRUE
# XGboost works with using the xgb.DMatrix function
X_train = xgb.DMatrix(as.matrix(X_train))
X_test = xgb.DMatrix(as.matrix(X_test))

# Creating a cross validation control
xgb_trcontrol = trainControl(
  method = "cv",
  number = 5,  
  allowParallel = TRUE,
  verboseIter = FALSE,
  returnData = FALSE
)

# Setting up a grid search for the best parameters
xgbGrid <- expand.grid(nrounds = c(100,200),  # this is n_estimators above
                       max_depth = c(10, 15, 20, 25),
                       colsample_bytree = seq(0.5, 0.9, length.out = 5),
                       ## The values below are default values in the sklearn-api. 
                       eta = 0.1,
                       gamma=0,
                       min_child_weight = 1,
                       subsample = 1
                      )
set.seed(123) 
xgb_model = train(
  X_train, Y_train,  
  trControl = xgb_trcontrol,
  tuneGrid = xgbGrid,
  method = "xgbTree"
)

As an alternative, we will be saving this model.

# The XGB_model is saved and will be loaded for time's sake.
saveRDS(xgb_model, "./xgb_model_project2_regress.rds")
xgb_model <- readRDS("./xgb_model_project2_regress.rds")

Predict pH

In this section, we will predict the pH values from our previous generated model.

# Procedure to calculate predicted values.
predicted <- predict(xgb_model, X_test)
residuals <- Y_test - predicted

RMSE

In this section, we will present the RMSE results.

# Procedure to calculate RMSE.
RMSE <- sqrt(mean(residuals^2))

y_test_mean = mean(Y_test)
# Calculate total sum of squares
tss =  sum((Y_test - y_test_mean)^2 )
# Calculate residual sum of squares
rss =  sum(residuals^2)
# Calculate R-squared
rsq  =  1 - (rss/tss)

# Rounding
RMSE <- round(RMSE,4)
rsq <- round(rsq,4)

paste0("Extreme Gradient Boosting R-Squared Value: ", rsq)
## [1] "Extreme Gradient Boosting R-Squared Value: 0.6212"
paste0("Extreme Gradient Boosting RMSE: ", RMSE)
## [1] "Extreme Gradient Boosting RMSE: 0.6183"

Not surprisingly, the Extreme Gradient Boosting model performed better than the OLS and Random Forest. Let us evaluate the most important variables.

Variable Importance

Let’s visualize the Variable Importance.

# Procedure to extract the Variable Importance.
xgb_varimp <- varImp(xgb_model)
xgb_varimp
## xgbTree variable importance
## 
##   only 20 most important variables shown (out of 26)
## 
##    Overall
## 12 100.000
## 18  73.916
## 16  54.262
## 11  46.857
## 24  46.384
## 17  46.076
## 21  36.744
## 8   32.988
## 13  32.503
## 2   27.871
## 0   26.229
## 9   22.939
## 15  21.414
## 19  16.058
## 1   13.568
## 14  13.446
## 20  12.322
## 25   6.574
## 22   5.877
## 5    4.694

Final Model Selection

The following Model is the one selected.

# Final Model Selection.
final_predicted <- predict(xgb_model, validation_df[,-1])

Transforming back our results

# Procedure to revert previous pre-processing of centered and scaled.
PH <- final_predicted * ph.sigma + ph.mu

# Round to two decimals
PH <- round(PH,2)
summary(PH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.270   8.450   8.500   8.511   8.570   8.790
# PH Visualization
hist(PH,
    main = 'Predicted PH',
    xlab = 'PH',
    col = 3)

Creating Excel file

The following procedure will export our PH results to an Excel file

# Writing the predictions as well as the correct results
csv_submission <- data.frame(predicted = PH)
head(csv_submission)
##   predicted
## 1      8.54
## 2      8.46
## 3      8.43
## 4      8.59
## 5      8.49
## 6      8.39
# Exporing to file.
eval.df$PH <- PH

# Uncomment this section if you like to create a csv file
write.csv(eval.df, "csv_PH_submission.csv")

The name of the file is csv_PH_submission.csv

Thank you!