Task

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Data

First, we load the libraries that we will use for our project.

## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:randomForest':
## 
##     outlier
## corrplot 0.84 loaded
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.1     ✓ purrr   0.3.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x psych::%+%()      masks ggplot2::%+%()
## x psych::alpha()    masks ggplot2::alpha()
## x dplyr::combine()  masks randomForest::combine()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x purrr::lift()     masks caret::lift()
## x ggplot2::margin() masks randomForest::margin()
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
## Loading required package: libcoin
## Loading required package: mvtnorm

Then we load our dataset and explore some of the statistics for our variables.

bev <- read.xlsx("StudentData.xlsx", sheetName = "Subset", stringsAsFactors = FALSE)
bev_eval <- read.xlsx("StudentEvaluation.xlsx", sheetName = "Subset (2)", stringsAsFactors = FALSE)

We have 2,571 samples and 32 features that we can use to train our predictive model of PH. Here are summary statistics for our features:

describe(bev)
## Warning in describe(bev): NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
##                   vars    n    mean      sd  median trimmed    mad     min
## Brand.Code*          1 2451     NaN      NA      NA     NaN     NA     Inf
## Carb.Volume          2 2561    5.37    0.11    5.35    5.37   0.11    5.04
## Fill.Ounces          3 2533   23.97    0.09   23.97   23.98   0.08   23.63
## PC.Volume            4 2532    0.28    0.06    0.27    0.27   0.05    0.08
## Carb.Pressure        5 2544   68.19    3.54   68.20   68.12   3.56   57.00
## Carb.Temp            6 2545  141.09    4.04  140.80  140.99   3.85  128.60
## PSC                  7 2538    0.08    0.05    0.08    0.08   0.05    0.00
## PSC.Fill             8 2548    0.20    0.12    0.18    0.18   0.12    0.00
## PSC.CO2              9 2532    0.06    0.04    0.04    0.05   0.03    0.00
## Mnf.Flow            10 2569   24.57  119.48   65.20   21.07 169.02 -100.20
## Carb.Pressure1      11 2539  122.59    4.74  123.20  122.54   4.45  105.60
## Fill.Pressure       12 2549   47.92    3.18   46.40   47.71   2.37   34.60
## Hyd.Pressure1       13 2560   12.44   12.43   11.40   10.84  16.90   -0.80
## Hyd.Pressure2       14 2556   20.96   16.39   28.60   21.05  13.34    0.00
## Hyd.Pressure3       15 2556   20.46   15.98   27.60   20.51  13.94   -1.20
## Hyd.Pressure4       16 2541   96.29   13.12   96.00   95.45  11.86   52.00
## Filler.Level        17 2551  109.25   15.70  118.40  111.04   9.19   55.80
## Filler.Speed        18 2514 3687.20  770.82 3982.00 3919.99  47.44  998.00
## Temperature         19 2557   65.97    1.38   65.60   65.80   0.89   63.60
## Usage.cont          20 2566   20.99    2.98   21.79   21.25   3.19   12.08
## Carb.Flow           21 2569 2468.35 1073.70 3028.00 2601.14 326.17   26.00
## Density             22 2570    1.17    0.38    0.98    1.15   0.15    0.24
## MFR                 23 2359  704.05   73.90  724.00  718.16  15.42   31.40
## Balling             24 2570    2.20    0.93    1.65    2.13   0.37   -0.17
## Pressure.Vacuum     25 2571   -5.22    0.57   -5.40   -5.25   0.59   -6.60
## PH                  26 2567    8.55    0.17    8.54    8.55   0.18    7.88
## Oxygen.Filler       27 2559    0.05    0.05    0.03    0.04   0.02    0.00
## Bowl.Setpoint       28 2569  109.33   15.30  120.00  111.35   0.00   70.00
## Pressure.Setpoint   29 2559   47.62    2.04   46.00   47.60   0.00   44.00
## Air.Pressurer       30 2571  142.83    1.21  142.60  142.58   0.59  140.80
## Alch.Rel            31 2562    6.90    0.51    6.56    6.84   0.06    5.28
## Carb.Rel            32 2561    5.44    0.13    5.40    5.43   0.12    4.96
## Balling.Lvl         33 2570    2.05    0.87    1.48    1.98   0.21    0.00
##                       max   range  skew kurtosis    se
## Brand.Code*          -Inf    -Inf    NA       NA    NA
## Carb.Volume          5.70    0.66  0.39    -0.47  0.00
## Fill.Ounces         24.32    0.69 -0.02     0.86  0.00
## PC.Volume            0.48    0.40  0.34     0.67  0.00
## Carb.Pressure       79.40   22.40  0.18    -0.01  0.07
## Carb.Temp          154.00   25.40  0.25     0.24  0.08
## PSC                  0.27    0.27  0.85     0.65  0.00
## PSC.Fill             0.62    0.62  0.93     0.77  0.00
## PSC.CO2              0.24    0.24  1.73     3.73  0.00
## Mnf.Flow           229.40  329.60  0.00    -1.87  2.36
## Carb.Pressure1     140.20   34.60  0.05     0.14  0.09
## Fill.Pressure       60.40   25.80  0.55     1.41  0.06
## Hyd.Pressure1       58.00   58.80  0.78    -0.14  0.25
## Hyd.Pressure2       59.40   59.40 -0.30    -1.56  0.32
## Hyd.Pressure3       50.00   51.20 -0.32    -1.57  0.32
## Hyd.Pressure4      142.00   90.00  0.55     0.63  0.26
## Filler.Level       161.20  105.40 -0.85     0.05  0.31
## Filler.Speed      4030.00 3032.00 -2.87     6.71 15.37
## Temperature         76.20   12.60  2.39    10.16  0.03
## Usage.cont          25.90   13.82 -0.54    -1.02  0.06
## Carb.Flow         5104.00 5078.00 -0.99    -0.58 21.18
## Density              1.92    1.68  0.53    -1.20  0.01
## MFR                868.60  837.20 -5.09    30.46  1.52
## Balling              4.01    4.18  0.59    -1.39  0.02
## Pressure.Vacuum     -3.60    3.00  0.53    -0.03  0.01
## PH                   9.36    1.48 -0.29     0.06  0.00
## Oxygen.Filler        0.40    0.40  2.66    11.09  0.00
## Bowl.Setpoint      140.00   70.00 -0.97    -0.06  0.30
## Pressure.Setpoint   52.00    8.00  0.20    -1.60  0.04
## Air.Pressurer      148.20    7.40  2.25     4.73  0.02
## Alch.Rel             8.62    3.34  0.88    -0.85  0.01
## Carb.Rel             6.06    1.10  0.50    -0.29  0.00
## Balling.Lvl          3.66    3.66  0.59    -1.49  0.02

Here are the distributions for our variables:

plot_histogram(bev)

We can also visualize how much data we are missing for each of the individual predictors.

apply(bev,2,function(x) sum(is.na(x)))
##        Brand.Code       Carb.Volume       Fill.Ounces         PC.Volume 
##               120                10                38                39 
##     Carb.Pressure         Carb.Temp               PSC          PSC.Fill 
##                27                26                33                23 
##           PSC.CO2          Mnf.Flow    Carb.Pressure1     Fill.Pressure 
##                39                 2                32                22 
##     Hyd.Pressure1     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4 
##                11                15                15                30 
##      Filler.Level      Filler.Speed       Temperature        Usage.cont 
##                20                57                14                 5 
##         Carb.Flow           Density               MFR           Balling 
##                 2                 1               212                 1 
##   Pressure.Vacuum                PH     Oxygen.Filler     Bowl.Setpoint 
##                 0                 4                12                 2 
## Pressure.Setpoint     Air.Pressurer          Alch.Rel          Carb.Rel 
##                12                 0                 9                10 
##       Balling.Lvl 
##                 1
plot_missing(bev)

We noticed that 4 samples had missing values for PH, so we decided to remove them from our analysis since we cannot use them to predict outcomes. This left us with 2,567 observations.

#removal of samples with missing PH values
bev <- bev[!is.na(bev$PH),]

Here we addressed the NA values of Brand.Code and assigned them with a value “U”.

bev$Brand.Code[is.na(bev$Brand.Code)] <- "E"
bev_eval$Brand.Code[is.na(bev_eval$Brand.Code)] <- "E"

Next, we use KNN imputation which imputes a missing value with the average weighted value of observations near/similar to it. We perform this imputation for missing values in all variables except for the response variable, PH, and the Brand.Code predictor as it is not numerical.

#imputations
bev_imp <- knnImputation(bev[, !names(bev) %in% c("Brand.Code", "PH")])
bev_imp$PH <- bev$PH
bev_imp$Brand.Code <- bev$Brand.Code

bev_eval_imp <- knnImputation(bev_eval[, !names(bev_eval) %in% c("Brand.Code", "PH")])
bev_eval_imp$PH <- bev_eval$PH
bev_eval_imp$Brand.Code <- bev_eval$Brand.Code

We also split our data into a training and test set, using 80% of our data to train our models and holding out 20% to test them.

#data splitting
set.seed(101) 
sample = sample.split(bev_imp$Brand.Code, SplitRatio = .8)
bev_train = subset(bev_imp, sample == TRUE)
bev_test  = subset(bev_imp, sample == FALSE)

bev_train_X = subset(bev_train, select = -PH)
bev_train_y = bev_train[,'PH']

bev_test_X = subset(bev_test, select = -PH)
bev_test_y = bev_test[,'PH']

Linear Regression Model

Linear regression is an attractive model because representation is simply done. The representation is a linear equation that combines a specific set of input values (x) the solution to which is the predicted output for that set of input values (y). In this instance, we will be predicting PH values through three different linear regression techniques.

GLM

The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

control = trainControl(method = 'cv', number = 5, 
  verboseIter = FALSE, savePredictions = TRUE,allowParallel = T)
set.seed(17)
GLM_bev_train = train(PH ~ ., data = bev_train, metric = 'RMSE', method = 'glm',preProcess = c('center', 'scale'), trControl = control)
GLM_reg_pred <- predict(GLM_bev_train, bev_test_X)

GLM_bev_train
## Generalized Linear Model 
## 
## 2053 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1643, 1642, 1642, 1643, 1642 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1337222  0.3895776  0.1032935

Using RMSE as a benchmark for linear regression modelling, we determined our GLM model performed quite well at 0.1341. We will attempt two other iterations with different linear techniques to determine if our baseline can be improved.

glmnet

glmnet is an extremely efficient procedure for fitting the entire lasso or elastic-net regularization path for linear regression, logistic and multinomial regression models, Poisson regres- sion and the Cox model. Two recent additions are the multiple-response Gaussian, and the grouped multinomial regression. The algorithm uses cyclical coordinate descent in a path-wise fashion to determine the best linear fit.

set.seed(17)
glmnet_bev_train = train(PH ~ ., data = bev_train , metric = 'RMSE', method = 'glmnet',preProcess = c('center', 'scale'), trControl = control)

glmnet_reg_pred <- predict(glmnet_bev_train, bev_test_X)
glmnet_bev_train
## glmnet 
## 
## 2053 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1643, 1642, 1642, 1643, 1642 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0001485109  0.1336357  0.3900801  0.1033219
##   0.10   0.0014851094  0.1335344  0.3902833  0.1035784
##   0.10   0.0148510939  0.1353402  0.3767042  0.1062931
##   0.55   0.0001485109  0.1336330  0.3900154  0.1033408
##   0.55   0.0014851094  0.1339330  0.3865947  0.1040226
##   0.55   0.0148510939  0.1400201  0.3420083  0.1105175
##   1.00   0.0001485109  0.1336439  0.3898367  0.1033700
##   1.00   0.0014851094  0.1344404  0.3820734  0.1046130
##   1.00   0.0148510939  0.1440696  0.3123323  0.1136521
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.001485109.

The final values used for the model were alpha = 0.1 and lambda = 0.001485109, which produced an RMSE of 0.1340. Just a slight inprovement of 0.001. The close results can be explained by low lambda value, where a zero lambda is in effect a standard glm model. With elastic net you may be accepting some bias in return for a reduction in the variance of the estimator. The zero lambda value should in principle return the same as a (nonpenalized) glm model. Hence the near identical RMSE values.

Partial Least Squares

Partial least squares (PLS) is a method for constructing predictive models when the factors are many and highly collinear. Partial least squares is a popular method for soft modelling in industrial applications. Partial least squares is a popular method for soft modelling in industrial applications. We believed that it would be an effective model to demonstrate.

set.seed(17)
pls_bev_train = train(PH ~ ., data = bev_train , metric = 'RMSE', method = 'pls',preProcess = c('center', 'scale'), trControl = control)

pls_reg_pred <- predict(pls_bev_train, bev_test_X)
pls_bev_train
## Partial Least Squares 
## 
## 2053 samples
##   32 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1643, 1642, 1642, 1643, 1642 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##   1      0.1497914  0.2328887  0.1186398
##   2      0.1417783  0.3127237  0.1114688
##   3      0.1388702  0.3399291  0.1096126
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

The final value used for the model was ncomp = 3, it produced and RMSE of 0.1394786. To our surprise, this model worked poorly among the three examples.

Although PLS accounts for over-fitting because of many manifest factors, there may be only a few underlying or latent factors that account for most of the variation in the response. The general idea of PLS is to try to extract these latent factors, accounting for as much of the manifest factor variation. In this instance, the variables do not possess latent tendencies and are directly observed. Therefore, this method of linear regression would be ineffective.

Non-linear Regression Models

In this section, we are going to fit a simple neural network using the neuralnet package and fit a linear model as a comparison.

bdev <- bev
data <- subset(bev_imp, select = -c(Brand.Code))

Confirming that there are no more empty data:

#describe(bev)
apply(data,2,function(x) sum(is.na(x)))
##       Carb.Volume       Fill.Ounces         PC.Volume     Carb.Pressure 
##                 0                 0                 0                 0 
##         Carb.Temp               PSC          PSC.Fill           PSC.CO2 
##                 0                 0                 0                 0 
##          Mnf.Flow    Carb.Pressure1     Fill.Pressure     Hyd.Pressure1 
##                 0                 0                 0                 0 
##     Hyd.Pressure2     Hyd.Pressure3     Hyd.Pressure4      Filler.Level 
##                 0                 0                 0                 0 
##      Filler.Speed       Temperature        Usage.cont         Carb.Flow 
##                 0                 0                 0                 0 
##           Density               MFR           Balling   Pressure.Vacuum 
##                 0                 0                 0                 0 
##     Oxygen.Filler     Bowl.Setpoint Pressure.Setpoint     Air.Pressurer 
##                 0                 0                 0                 0 
##          Alch.Rel          Carb.Rel       Balling.Lvl                PH 
##                 0                 0                 0                 0

There is no missing data, good. We proceed by randomly splitting the data into a train and a test set, then we fit a linear regression model and test it on the test set. Note that I am using the gml() function instead of the lm() this will become useful later when cross validating the linear model.

index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]
lm.fit <- glm(PH~., data=train)
summary(lm.fit)
## 
## Call:
## glm(formula = PH ~ ., data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.60499  -0.07357   0.01180   0.09077   0.59490  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.080e+01  1.194e+00   9.045  < 2e-16 ***
## Carb.Volume       -1.594e-02  8.231e-02  -0.194 0.846464    
## Fill.Ounces       -9.524e-02  3.860e-02  -2.467 0.013695 *  
## PC.Volume         -1.428e-01  6.481e-02  -2.203 0.027698 *  
## Carb.Pressure     -7.673e-05  3.584e-03  -0.021 0.982921    
## Carb.Temp          1.278e-03  2.853e-03   0.448 0.654316    
## PSC               -6.555e-02  7.003e-02  -0.936 0.349338    
## PSC.Fill          -6.658e-02  2.793e-02  -2.383 0.017251 *  
## PSC.CO2           -1.694e-01  7.637e-02  -2.218 0.026706 *  
## Mnf.Flow          -7.357e-04  5.493e-05 -13.394  < 2e-16 ***
## Carb.Pressure1     6.715e-03  8.356e-04   8.036 1.61e-15 ***
## Fill.Pressure      6.980e-04  1.475e-03   0.473 0.636039    
## Hyd.Pressure1     -3.898e-04  4.383e-04  -0.889 0.373920    
## Hyd.Pressure2     -7.078e-04  6.490e-04  -1.091 0.275522    
## Hyd.Pressure3      3.280e-03  7.113e-04   4.612 4.26e-06 ***
## Hyd.Pressure4     -4.157e-04  3.605e-04  -1.153 0.248972    
## Filler.Level      -1.428e-03  6.652e-04  -2.146 0.032007 *  
## Filler.Speed       1.003e-06  9.023e-06   0.111 0.911545    
## Temperature       -2.129e-02  2.602e-03  -8.183 5.01e-16 ***
## Usage.cont        -7.125e-03  1.379e-03  -5.166 2.64e-07 ***
## Carb.Flow          8.754e-06  4.296e-06   2.038 0.041733 *  
## Density           -1.355e-01  3.344e-02  -4.052 5.27e-05 ***
## MFR               -5.532e-05  7.182e-05  -0.770 0.441266    
## Balling           -7.895e-02  2.790e-02  -2.829 0.004712 ** 
## Pressure.Vacuum   -1.065e-02  9.098e-03  -1.171 0.241945    
## Oxygen.Filler     -3.073e-01  8.571e-02  -3.585 0.000345 ***
## Bowl.Setpoint      3.279e-03  6.992e-04   4.690 2.93e-06 ***
## Pressure.Setpoint -9.487e-03  2.347e-03  -4.041 5.53e-05 ***
## Air.Pressurer     -1.429e-03  2.899e-03  -0.493 0.621961    
## Alch.Rel           1.020e-01  1.938e-02   5.262 1.59e-07 ***
## Carb.Rel           1.156e-01  5.489e-02   2.106 0.035352 *  
## Balling.Lvl        7.087e-02  2.533e-02   2.798 0.005194 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.01897517)
## 
##     Null deviance: 56.29  on 1924  degrees of freedom
## Residual deviance: 35.92  on 1893  degrees of freedom
## AIC: -2135.3
## 
## Number of Fisher Scoring iterations: 2
pr.lm <- predict(lm.fit,test)
MSE.lm <- sum((pr.lm - test$medv)^2)/nrow(test)
MSE.lm
## [1] 0

The sample(x,size) function simply outputs a vector of the specified size of randomly selected samples from the vector x. By default the sampling is without replacement: index is essentially a random vector of indeces.

Since we are dealing with a regression problem, we are going to use the mean squared error (MSE) as a measure of how much our predictions are far away from the real data.

Preparing to fit the neural network

Before fitting a neural network, some preparation needs to be done.

As a first step, we are going to address data preprocessing. I will be normalizing the data before training a neural network. I chose to use the min-max method and scale the data in the interval [0,1]. Usually scaling in the intervals [0,1] or [-1,1] tends to give better results.

We therefore scale and split the data before moving on:

#data
maxs <- apply(data, 2, max) 
mins <- apply(data, 2, min)

Scaled returns a matrix that needs to be coerced into a data.frame.

maxs <- apply(data, 2, max) 
mins <- apply(data, 2, min)
scaled <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
#scaled
train_ <- scaled[index,]
test_ <- scaled[-index,]

Parameters

As far as I know there is no fixed rule as to how many layers and neurons to use although there are several more or less accepted rules of thumb. Usually, if at all necessary, one hidden layer is enough for a vast numbers of applications. As far as the number of neurons is concerned, it should be between the input layer size and the output layer size, usually 2/3 of the input size. At least in my brief experience testing again and again is the best solution since there is no guarantee that any of these rules will fit your model best. In this dataset, we are going to use 2 hidden layers with this configuration: 32:5:3:1. The input layer has 32 inputs, the two hidden layers have 5 and 3 neurons and the output layer has, of course, a single output since we are doing regression.

Let’s fit the net: Setting the linear.output = True does regression instead of classification.

n <- names(train_)
f <- as.formula(paste("PH ~", paste(n[!n %in% "PH"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)

Plot the Neural Network

plot(nn)

The black lines show the connections between each layer and the weights on each connection while the blue lines show the bias term added in each step. The bias can be thought as the intercept of a linear model. The net is essentially a black box so we cannot say that much about the fitting, the weights and the model. Suffice to say that the training algorithm has converged and therefore the model is ready to be used.

Predicting PH using the neural network

Now we can try to predict the values for the test set and calculate the MSE. The net will output a normalized prediction, so we need to scale it back in order to make a meaningful comparison (or just a simple prediction).

pr.nn <- compute(nn,test_[,1:32])
pr.nn_ <- pr.nn$net.result*(max(data$PH)-min(data$PH))+min(data$PH)
test.r <- (test_$PH)*(max(data$PH)-min(data$PH))+min(data$PH)
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
print(paste(MSE.lm,MSE.nn))
## [1] "0 0.0177398095734633"

Apparently, the Linear Model is doing a better job than the Neural Network at predicting PH in this case as the MSE is 0 for the lm.

lm.rmse <- sqrt(MSE.lm)
print(lm.rmse)
## [1] 0

The RMSE for our Neural Network is as follows:

nn.rmse <- sqrt(MSE.nn)
print(nn.rmse)
## [1] 0.1331909

The RMSE for the linear model is also 0.

Output Plot

plot(test$PH,pr.nn_,col='red',main='Real vs predicted NN',pch=18,cex=0.7)
points(test$PH,pr.lm,col='blue',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend=c('NN','LM'),pch=18,col=c('red','blue'))

Tree Models

In this next part, we consider various tree models to predict the PH of a beverage given our information about the 32 manufacturing features.

Basic Regression Tree

Classification and regression trees can be generated througn rpart to create simple tree models. Tree-based models consist of one or more nested if-then statements for the predictors that partition the data. A model is used to predict the outcome within these partitions.

bevb <- train(x = bev_train_X, y = bev_train_y, method = "rpart", preProcess = c('center', 'scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
bevb
## CART 
## 
## 2053 samples
##   32 predictor
## 
## Pre-processing: centered (31), scaled (31), ignore (1) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2053, 2053, 2053, 2053, 2053, 2053, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.04517033  0.1440003  0.2811415  0.1131109
##   0.06148521  0.1483355  0.2369000  0.1158088
##   0.21000757  0.1613637  0.1948827  0.1278185
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04517033.
bevbPred <- predict(bevb, newdata = bev_test_X)
bevb.results <- postResample(pred = bevbPred, obs = bev_test_y)
bevb.results
##      RMSE  Rsquared       MAE 
## 0.1465564 0.3279953 0.1168731

The RMSE for this model is 0.147 with an RM^2 of 0.38.

We also plot this specific tree below:

plot(as.party(bevb$finalModel))

Next, we will try out a Random Forest model to see if we can improve upon this model.

Random Forest

Random Forest is an ensemble model where each tree splits out a class prediction and the class with the most contributions becomes the model’s prediction value. Random Forest creates as many trees on the subset of the data and combines the output of all the trees. This thus reduces problems in overfitting and reduces the variance.

bev_train_X2 <- bev_train_X
bev_train_X2$Brand.Code <- as.factor(bev_train_X2$Brand.Code)

bevrf <- train(x = bev_train_X2, y = bev_train_y, method = "rf", preProcess = c('center', 'scale'))
bevrf
## Random Forest 
## 
## 2053 samples
##   32 predictor
## 
## Pre-processing: centered (31), scaled (31), ignore (1) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2053, 2053, 2053, 2053, 2053, 2053, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1143393  0.5966905  0.08632714
##   17    0.1021765  0.6533844  0.07452249
##   32    0.1023225  0.6450190  0.07425013
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 17.
bev_test_X2 <- bev_test_X
bev_test_X2$Brand.Code <- as.factor(bev_test_X2$Brand.Code)

bevrfPred <- predict(bevrf, newdata = bev_test_X2)
bevrf.results <- postResample(pred = bevrfPred, obs = bev_test_y)
bevrf.results
##       RMSE   Rsquared        MAE 
## 0.09605637 0.73994283 0.07029959

The RMSE for this model is 0.096 with a RM^2 of 0.739. We will also consider an XGBoost model to see if we can find any improvements.

XGBoost

XGBoost is another ensemble model, this time using the gradient boosting framework which is a special case of boosting where errors are minimized by gradient descent algorithm. XGBoost only manages numeric vectors, so we have to recode our Brand.Code feature into numeric before tuning our model.

bev_train_X3 <- bev_train_X
bev_train_X3$Brand.Code <- as.numeric(bev_train_X2$Brand.Code)

bevxgb <- train(x = bev_train_X3, y = bev_train_y, method = "xgbTree")
bevxgb
## eXtreme Gradient Boosting 
## 
## 2053 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2053, 2053, 2053, 2053, 2053, 2053, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  RMSE       Rsquared 
##   0.3  1          0.6               0.50        50      0.1348444  0.3823046
##   0.3  1          0.6               0.50       100      0.1320053  0.4038251
##   0.3  1          0.6               0.50       150      0.1311875  0.4109874
##   0.3  1          0.6               0.75        50      0.1345659  0.3863117
##   0.3  1          0.6               0.75       100      0.1314800  0.4095951
##   0.3  1          0.6               0.75       150      0.1304921  0.4172032
##   0.3  1          0.6               1.00        50      0.1346729  0.3879390
##   0.3  1          0.6               1.00       100      0.1314202  0.4127691
##   0.3  1          0.6               1.00       150      0.1301949  0.4214902
##   0.3  1          0.8               0.50        50      0.1345012  0.3853522
##   0.3  1          0.8               0.50       100      0.1317437  0.4060782
##   0.3  1          0.8               0.50       150      0.1311550  0.4114072
##   0.3  1          0.8               0.75        50      0.1343933  0.3882591
##   0.3  1          0.8               0.75       100      0.1314519  0.4100160
##   0.3  1          0.8               0.75       150      0.1302779  0.4192618
##   0.3  1          0.8               1.00        50      0.1341937  0.3931406
##   0.3  1          0.8               1.00       100      0.1312519  0.4144106
##   0.3  1          0.8               1.00       150      0.1299507  0.4236685
##   0.3  2          0.6               0.50        50      0.1271272  0.4472720
##   0.3  2          0.6               0.50       100      0.1249203  0.4668120
##   0.3  2          0.6               0.50       150      0.1242268  0.4749295
##   0.3  2          0.6               0.75        50      0.1255304  0.4618236
##   0.3  2          0.6               0.75       100      0.1228114  0.4840227
##   0.3  2          0.6               0.75       150      0.1217355  0.4938568
##   0.3  2          0.6               1.00        50      0.1258326  0.4600049
##   0.3  2          0.6               1.00       100      0.1223547  0.4876801
##   0.3  2          0.6               1.00       150      0.1212365  0.4972870
##   0.3  2          0.8               0.50        50      0.1262705  0.4544991
##   0.3  2          0.8               0.50       100      0.1247231  0.4688732
##   0.3  2          0.8               0.50       150      0.1240470  0.4771093
##   0.3  2          0.8               0.75        50      0.1257802  0.4591446
##   0.3  2          0.8               0.75       100      0.1228193  0.4842349
##   0.3  2          0.8               0.75       150      0.1217467  0.4942752
##   0.3  2          0.8               1.00        50      0.1246914  0.4703939
##   0.3  2          0.8               1.00       100      0.1215464  0.4946886
##   0.3  2          0.8               1.00       150      0.1204430  0.5038795
##   0.3  3          0.6               0.50        50      0.1243324  0.4714911
##   0.3  3          0.6               0.50       100      0.1233583  0.4841541
##   0.3  3          0.6               0.50       150      0.1230778  0.4892816
##   0.3  3          0.6               0.75        50      0.1215663  0.4939680
##   0.3  3          0.6               0.75       100      0.1194024  0.5135975
##   0.3  3          0.6               0.75       150      0.1188775  0.5195427
##   0.3  3          0.6               1.00        50      0.1210272  0.4987666
##   0.3  3          0.6               1.00       100      0.1185590  0.5193150
##   0.3  3          0.6               1.00       150      0.1179502  0.5252458
##   0.3  3          0.8               0.50        50      0.1242515  0.4725333
##   0.3  3          0.8               0.50       100      0.1227746  0.4895285
##   0.3  3          0.8               0.50       150      0.1228564  0.4927192
##   0.3  3          0.8               0.75        50      0.1221868  0.4892047
##   0.3  3          0.8               0.75       100      0.1202711  0.5072904
##   0.3  3          0.8               0.75       150      0.1198294  0.5127807
##   0.3  3          0.8               1.00        50      0.1210405  0.4985188
##   0.3  3          0.8               1.00       100      0.1188145  0.5174223
##   0.3  3          0.8               1.00       150      0.1182544  0.5231257
##   0.4  1          0.6               0.50        50      0.1341094  0.3857087
##   0.4  1          0.6               0.50       100      0.1322465  0.4018726
##   0.4  1          0.6               0.50       150      0.1318573  0.4068002
##   0.4  1          0.6               0.75        50      0.1333919  0.3937111
##   0.4  1          0.6               0.75       100      0.1311540  0.4110954
##   0.4  1          0.6               0.75       150      0.1305185  0.4171463
##   0.4  1          0.6               1.00        50      0.1330359  0.3998425
##   0.4  1          0.6               1.00       100      0.1304134  0.4192708
##   0.4  1          0.6               1.00       150      0.1294001  0.4271819
##   0.4  1          0.8               0.50        50      0.1341155  0.3855635
##   0.4  1          0.8               0.50       100      0.1326574  0.3979709
##   0.4  1          0.8               0.50       150      0.1327138  0.3999915
##   0.4  1          0.8               0.75        50      0.1330398  0.3975192
##   0.4  1          0.8               0.75       100      0.1308921  0.4134917
##   0.4  1          0.8               0.75       150      0.1302316  0.4197039
##   0.4  1          0.8               1.00        50      0.1331379  0.3985828
##   0.4  1          0.8               1.00       100      0.1305303  0.4183593
##   0.4  1          0.8               1.00       150      0.1295697  0.4255322
##   0.4  2          0.6               0.50        50      0.1283154  0.4375900
##   0.4  2          0.6               0.50       100      0.1270359  0.4544894
##   0.4  2          0.6               0.50       150      0.1268460  0.4606956
##   0.4  2          0.6               0.75        50      0.1261218  0.4555841
##   0.4  2          0.6               0.75       100      0.1241155  0.4752960
##   0.4  2          0.6               0.75       150      0.1236380  0.4817992
##   0.4  2          0.6               1.00        50      0.1247331  0.4677182
##   0.4  2          0.6               1.00       100      0.1221623  0.4900726
##   0.4  2          0.6               1.00       150      0.1216626  0.4959950
##   0.4  2          0.8               0.50        50      0.1277910  0.4421537
##   0.4  2          0.8               0.50       100      0.1267644  0.4561743
##   0.4  2          0.8               0.50       150      0.1269287  0.4602925
##   0.4  2          0.8               0.75        50      0.1256522  0.4594194
##   0.4  2          0.8               0.75       100      0.1236538  0.4791164
##   0.4  2          0.8               0.75       150      0.1232684  0.4856178
##   0.4  2          0.8               1.00        50      0.1243008  0.4714418
##   0.4  2          0.8               1.00       100      0.1223927  0.4881998
##   0.4  2          0.8               1.00       150      0.1218716  0.4943462
##   0.4  3          0.6               0.50        50      0.1277691  0.4488687
##   0.4  3          0.6               0.50       100      0.1279276  0.4575580
##   0.4  3          0.6               0.50       150      0.1281195  0.4612835
##   0.4  3          0.6               0.75        50      0.1232397  0.4824109
##   0.4  3          0.6               0.75       100      0.1226484  0.4922922
##   0.4  3          0.6               0.75       150      0.1228919  0.4933984
##   0.4  3          0.6               1.00        50      0.1212984  0.4966803
##   0.4  3          0.6               1.00       100      0.1201459  0.5091282
##   0.4  3          0.6               1.00       150      0.1202818  0.5102040
##   0.4  3          0.8               0.50        50      0.1259074  0.4645068
##   0.4  3          0.8               0.50       100      0.1264617  0.4694143
##   0.4  3          0.8               0.50       150      0.1273376  0.4677091
##   0.4  3          0.8               0.75        50      0.1234238  0.4812640
##   0.4  3          0.8               0.75       100      0.1231713  0.4891812
##   0.4  3          0.8               0.75       150      0.1234571  0.4901053
##   0.4  3          0.8               1.00        50      0.1211727  0.4980852
##   0.4  3          0.8               1.00       100      0.1202789  0.5086490
##   0.4  3          0.8               1.00       150      0.1203230  0.5107155
##   MAE       
##   0.10483504
##   0.10183430
##   0.10083551
##   0.10465561
##   0.10169264
##   0.10044165
##   0.10493589
##   0.10194501
##   0.10066435
##   0.10447757
##   0.10153575
##   0.10069593
##   0.10452288
##   0.10158124
##   0.10021809
##   0.10462855
##   0.10177400
##   0.10045887
##   0.09741027
##   0.09522894
##   0.09446470
##   0.09630067
##   0.09338401
##   0.09230971
##   0.09644014
##   0.09305954
##   0.09192091
##   0.09646372
##   0.09464185
##   0.09395021
##   0.09622544
##   0.09319100
##   0.09216416
##   0.09570384
##   0.09243421
##   0.09126579
##   0.09433460
##   0.09332015
##   0.09301100
##   0.09221326
##   0.09013184
##   0.08951059
##   0.09172819
##   0.08920710
##   0.08857080
##   0.09420147
##   0.09279142
##   0.09275244
##   0.09275341
##   0.09084203
##   0.09029744
##   0.09152756
##   0.08931455
##   0.08861755
##   0.10392997
##   0.10178471
##   0.10117610
##   0.10342010
##   0.10110100
##   0.10042530
##   0.10345532
##   0.10081684
##   0.09977308
##   0.10385682
##   0.10199029
##   0.10176775
##   0.10314741
##   0.10071457
##   0.09978745
##   0.10344656
##   0.10094106
##   0.09980682
##   0.09786071
##   0.09647071
##   0.09617791
##   0.09616451
##   0.09413578
##   0.09353021
##   0.09514360
##   0.09275854
##   0.09220335
##   0.09735515
##   0.09631818
##   0.09645978
##   0.09573556
##   0.09381051
##   0.09321017
##   0.09503359
##   0.09292912
##   0.09225425
##   0.09659734
##   0.09655981
##   0.09668012
##   0.09366038
##   0.09283432
##   0.09286321
##   0.09167854
##   0.09074592
##   0.09062142
##   0.09547108
##   0.09554308
##   0.09626054
##   0.09329068
##   0.09279226
##   0.09295755
##   0.09169396
##   0.09060586
##   0.09047154
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
##  = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 1.
bev_test_X3 <- bev_test_X
bev_test_X3$Brand.Code <- as.numeric(as.factor(bev_test_X3$Brand.Code))

bevxgbPred <- predict(bevxgb, newdata = bev_test_X3)
bevxgb.results <- postResample(pred = bevxgbPred, obs = bev_test_y)
bevxgb.results
##       RMSE   Rsquared        MAE 
## 0.10898447 0.62870015 0.08165521

The RMSE for this model is 0.107 with a RM^2 of 0.64.

Lastly for this section, the results of these 3 Tree models are summarized here:

xgb <- as.data.frame(as.list(bevxgb.results))
basic <- as.data.frame(as.list(bevb.results))
rf <- as.data.frame(as.list(bevrf.results))
xgb$model <- 'XGBoost'
basic$model <- 'Basic Regression Tree'
rf$model <- 'Random Forest'

tree.outcomes <- rbind(xgb, basic, rf)
tree.outcomes
##         RMSE  Rsquared        MAE                 model
## 1 0.10898447 0.6287002 0.08165521               XGBoost
## 2 0.14655642 0.3279953 0.11687313 Basic Regression Tree
## 3 0.09605637 0.7399428 0.07029959         Random Forest

We considered Basic Regression, Random Forest and XGBoost tree models, and Random Forest performed the best in predicting PH as it had the smallest RMSE value at 0.1 and an R^2 of 0.7.

In the Random Forest model, these were the most important predictors:

varImp(bevrf)
## rf variable importance
## 
##   only 20 most important variables shown (out of 32)
## 
##                 Overall
## Mnf.Flow        100.000
## Brand.Code       50.672
## Usage.cont       44.462
## Oxygen.Filler    28.576
## Bowl.Setpoint    24.134
## Alch.Rel         22.050
## Temperature      21.678
## Pressure.Vacuum  21.432
## Carb.Rel         20.932
## Balling.Lvl      19.665
## Air.Pressurer    16.480
## Carb.Pressure1   16.451
## Filler.Level     13.592
## Filler.Speed     13.325
## Balling          12.348
## Carb.Flow        11.095
## Density           9.610
## Hyd.Pressure3     8.417
## PC.Volume         7.724
## MFR               6.259

Conclusion

Finally, we will choose our best model for predicting PH among the chosen Linear, Non-Linear and Tree Models that we have gone over in this analysis. Their RMSE metrics are summarized here:

lin_model_perf <- getTrainPerf(glmnet_bev_train)
print(lin_model_perf)
##   TrainRMSE TrainRsquared  TrainMAE method
## 1 0.1335344     0.3902833 0.1035784 glmnet
print(nn.rmse)
## [1] 0.1331909
rf <- as.data.frame(as.list(bevrf.results))
print(rf)
##         RMSE  Rsquared        MAE
## 1 0.09605637 0.7399428 0.07029959

The RMSE for the chosen Linear Model (glm) was 0.134. The RMSE for the chosen Non-Linear Model (Neural Net) was 0.136. And lastly, the RMSE for the chosen Tree model (Random Forest) was 0.096. So, we chose the Random Forest Model to predict the PH of our bevarages given the predictors on hand.

In this chosen Random Forest model, the top 5 predictors that we found to influence PH in the manufacturing process are Mnf.Flow, Brand.Code, Usage.cont, Oxygen.Filler, and Alch.Rel. We should pay close attention to these predictors when trying to control for the PH of our beverages given the new regulations that have been enacted.

Finally, we will predict the PH values for the evaluation dataset that has been provided using the Random Forest model that was chosen as the best model. We are including these predictions with this submission as well.

bev_eval_X2 <- bev_eval_imp
bev_eval_X2$Brand.Code <- as.factor(bev_eval_X2$Brand.Code)

finalPH <- predict(bevrf, newdata = bev_eval_X2)

finalResult <- bev_eval
finalResult$PH <- finalPH
  
head(finalResult)
##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          D    5.480000    24.03333 0.2700000          65.4     134.6 0.236
## 2          A    5.393333    23.95333 0.2266667          63.2     135.0 0.042
## 3          B    5.293333    23.92000 0.3033333          66.4     140.4 0.068
## 4          B    5.266667    23.94000 0.1860000          64.8     139.0 0.004
## 5          B    5.406667    24.20000 0.1600000          69.4     142.2 0.040
## 6          B    5.286667    24.10667 0.2120000          73.4     147.2 0.078
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.40    0.04     -100          116.6          46.0             0
## 2     0.22    0.08     -100          118.8          46.2             0
## 3     0.10    0.02     -100          120.2          45.8             0
## 4     0.20    0.02     -100          124.8          40.0             0
## 5     0.30    0.06     -100          115.0          51.4             0
## 6     0.22      NA     -100          118.6          46.4             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA            96        129.4         3986
## 2             0             0           112        120.0         4012
## 3             0             0            98        119.4         4010
## 4             0             0           132        120.2           NA
## 5             0             0            94        116.0         4018
## 6             0             0            94        120.4         4010
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum
## 1        66.0      21.66      2950    0.88 727.6   1.398            -3.8
## 2        65.6      17.60      2916    1.50 735.8   2.942            -4.4
## 3        65.6      24.18      3056    0.90 734.8   1.448            -4.2
## 4        74.4      18.12        28    0.74    NA   1.056            -4.0
## 5        66.4      21.32      3214    0.88 752.0   1.398            -4.0
## 6        66.6      18.00      3064    0.84 732.0   1.298            -3.8
##         PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel
## 1 8.561109         0.022           130              45.2         142.6     6.56
## 2 8.482686         0.030           120              46.0         147.2     7.14
## 3 8.522793         0.046           120              46.0         146.6     6.52
## 4 8.567560            NA           120              46.0         146.4     6.48
## 5 8.486979         0.082           120              50.0         145.8     6.50
## 6 8.531598         0.064           120              46.0         146.0     6.50
##   Carb.Rel Balling.Lvl
## 1     5.34        1.48
## 2     5.58        3.04
## 3     5.34        1.46
## 4     5.50        1.48
## 5     5.38        1.46
## 6     5.42        1.44
#write.csv(finalResult, "predictions.csv")