New regulations require us to better understand the manufacturing process of our beverages. In this report, we will demonstrate the process by which we created a predictive model that will determine the PH level in a can. We will outline how we prepared the data and selected our predictive model. Finally, we will use the selected model to make predictions on a new set of data.

Data Preparation

To prepare the data for building models, we plotted the missing data. There is very little missing data, and it is widely spread across the samples. This indicates that we should be able to successfully impute the missing data without worrying about injecting too much new information into the data set.

student.data <- readxl::read_excel('./data/StudentData.xlsx')
student.evaluation <- readxl::read_excel('./data/StudentEvaluation.xlsx')

missing.data.plot <- function(data){
  data %>%
    VIM::aggr(col=c('navyblue', 'yellow'),
      numbers=TRUE, sortVars=TRUE,
      labels=names(data), cex.axis=.6,
      gap=3, ylab=c('Missing Data', 'Pattern'), combined=TRUE
    )
}

missing.data.plot(student.data)

## 
##  Variables sorted by number of missings: 
##           Variable Count
##                MFR   212
##         Brand Code   120
##       Filler Speed    57
##          PC Volume    39
##            PSC CO2    39
##        Fill Ounces    38
##                PSC    33
##     Carb Pressure1    32
##      Hyd Pressure4    30
##      Carb Pressure    27
##          Carb Temp    26
##           PSC Fill    23
##      Fill Pressure    22
##       Filler Level    20
##      Hyd Pressure2    15
##      Hyd Pressure3    15
##        Temperature    14
##      Oxygen Filler    12
##  Pressure Setpoint    12
##      Hyd Pressure1    11
##        Carb Volume    10
##           Carb Rel    10
##           Alch Rel     9
##         Usage cont     5
##                 PH     4
##           Mnf Flow     2
##          Carb Flow     2
##      Bowl Setpoint     2
##            Density     1
##            Balling     1
##        Balling Lvl     1
##    Pressure Vacuum     0
##      Air Pressurer     0

We discovered that four samples are missing the response variable PH. There are a number of ways to address this issue but ultimately we decided to simply remove these samples. The goal of the model is to predict the PH level and we were concerned about biasing the data by imputing these values. In addition there is one major outlier whose PH level is several orders of magnitude higher than the other values. Although we ultimately selected a modeling technique that is robust to outliers, we believed that this value was either incorrectly recorded or the result of a massive aberration that is unlikely to be seen again. Thus, we felt it best to remove this sample.

student.data %>%
  ggplot(aes(PH, fill=PH > 9)) + 
  geom_histogram(bins=30) +
  theme_bw() +
  theme(legend.position='none') +
  labs(y='Count',
       title='PH Levels in Training Data')
## Warning: Removed 4 rows containing non-finite values (stat_bin).

student.data <- student.data %>%
  filter(!is.na(student.data$PH),
         student.data$PH < 9)

We created a simple data processing pipeline to fix a number of formatting issues with the data. We also one hot encoded the only categorical variable ‘Brand’. Most importantly, this pipeline also imputes the missing data. We selected a powerful imputation method called MICE. In short, MICE functions by calculating a unique imputation model for each predictor treating that predictor as a response variable. This process is repeated several times until the imputed values stabilize.

clean.cols <- function(name){
  name <- gsub('\\s', '', name)
  name <- tolower(name)
  return(name)
}

pipeline <- function(data){
  ph <- data$PH
  
  data <- data %>%
    select(-PH) %>%
    mutate(`Brand Code` = as.factor(`Brand Code`)) %>%
    rename_all(list(f = ~clean.cols(.))) %>%
    mice::mice(m=5, maxit=10, seed=123, printFlag=FALSE) %>%
    mice::complete(1)
  
    x <- predict(dummyVars("~ .", data=data), newdata=data) %>%
        as_tibble()
  
  return(cbind(ph, x))
}

student.data.complete <- pipeline(student.data)
student.evaluation.complete <- pipeline(student.evaluation)

Finally, we confirm that all the data has been properly imputed.

missing.data.plot(student.data.complete)

## 
##  Variables sorted by number of missings: 
##          Variable Count
##                ph     0
##       brandcode.A     0
##       brandcode.B     0
##       brandcode.C     0
##       brandcode.D     0
##        carbvolume     0
##        fillounces     0
##          pcvolume     0
##      carbpressure     0
##          carbtemp     0
##               psc     0
##           pscfill     0
##            pscco2     0
##           mnfflow     0
##     carbpressure1     0
##      fillpressure     0
##      hydpressure1     0
##      hydpressure2     0
##      hydpressure3     0
##      hydpressure4     0
##       fillerlevel     0
##       fillerspeed     0
##       temperature     0
##         usagecont     0
##          carbflow     0
##           density     0
##               mfr     0
##           balling     0
##    pressurevacuum     0
##      oxygenfiller     0
##      bowlsetpoint     0
##  pressuresetpoint     0
##      airpressurer     0
##           alchrel     0
##           carbrel     0
##        ballinglvl     0
missing.data.plot(student.evaluation.complete)

## 
##  Variables sorted by number of missings: 
##          Variable Count
##                ph   267
##       brandcode.A     0
##       brandcode.B     0
##       brandcode.C     0
##       brandcode.D     0
##        carbvolume     0
##        fillounces     0
##          pcvolume     0
##      carbpressure     0
##          carbtemp     0
##               psc     0
##           pscfill     0
##            pscco2     0
##           mnfflow     0
##     carbpressure1     0
##      fillpressure     0
##      hydpressure1     0
##      hydpressure2     0
##      hydpressure3     0
##      hydpressure4     0
##       fillerlevel     0
##       fillerspeed     0
##       temperature     0
##         usagecont     0
##          carbflow     0
##           density     0
##               mfr     0
##           balling     0
##    pressurevacuum     0
##      oxygenfiller     0
##      bowlsetpoint     0
##  pressuresetpoint     0
##      airpressurer     0
##           alchrel     0
##           carbrel     0
##        ballinglvl     0

With the data properly cleaned, we can now perform exploratory data analysis.

EDA

After cleaning the data we proceeded to perform exploratory data analysis in order to better understand the data set that we are working with. Our data set consists of 36 variables. One is our target variable, ph. 4 Variables are based on brand. For regression models, we’ll use three of these, with brand a as the base class. Based on Cook’s distance, we removed 8 variables. For our regression based models, this improved RMSE by \(\approx .003\). 2354 had the highest Cook’s Distance, \(\approx .1\).

multi.model <- student.data.complete %>%
  select(-2) %>%
  filter(!row_number() %in% c(2354,2082,690,1896,1841,475,2562,2149)) %>%
  lm(data=., ph~.)

multi.model %>%
  influencePlot(id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

##         StudRes         Hat       CookD
## 367  -4.0520330 0.008196838 0.003853476
## 493   2.5018604 0.052924672 0.009973036
## 1051  0.6497145 0.117639071 0.001608355
## 1382  2.0121671 0.085853058 0.010851139
## 1660 -0.1486110 0.164553254 0.000124334
## 2550 -3.9046434 0.012993535 0.005702387

Within our model, 13 variables were significant at .001 significance: brand code b, mnfflow, carbpressure1, hydpressure3, temperature, usagecont, density, balling, pressurevacuum, oxygenfiller, bowlsetpoint, pressuresetpoint, ballinglvl. An additional 6 variables were significant at .05.

multi.model %>%
  summary()
## 
## Call:
## lm(formula = ph ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52781 -0.07668  0.01003  0.08698  0.41612 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.117e+01  1.077e+00  10.379  < 2e-16 ***
## brandcode.B       8.871e-02  2.131e-02   4.163 3.25e-05 ***
## brandcode.C      -4.207e-02  2.113e-02  -1.991  0.04662 *  
## brandcode.D       5.192e-02  1.750e-02   2.967  0.00304 ** 
## carbvolume       -1.461e-01  8.944e-02  -1.633  0.10250    
## fillounces       -8.211e-02  3.201e-02  -2.565  0.01037 *  
## pcvolume         -1.037e-01  5.387e-02  -1.924  0.05441 .  
## carbpressure      3.096e-03  4.177e-03   0.741  0.45867    
## carbtemp         -1.668e-03  3.294e-03  -0.506  0.61257    
## psc              -7.863e-02  5.626e-02  -1.397  0.16240    
## pscfill          -3.531e-02  2.309e-02  -1.529  0.12643    
## pscco2           -1.224e-01  6.248e-02  -1.959  0.05017 .  
## mnfflow          -7.495e-04  4.592e-05 -16.320  < 2e-16 ***
## carbpressure1     6.185e-03  6.880e-04   8.990  < 2e-16 ***
## fillpressure      1.332e-03  1.210e-03   1.102  0.27077    
## hydpressure1     -4.086e-05  3.630e-04  -0.113  0.91039    
## hydpressure2     -1.188e-03  5.309e-04  -2.237  0.02535 *  
## hydpressure3      3.581e-03  5.830e-04   6.142 9.44e-10 ***
## hydpressure4     -8.337e-05  3.187e-04  -0.262  0.79366    
## fillerlevel      -8.664e-04  5.605e-04  -1.546  0.12231    
## fillerspeed       1.930e-05  9.941e-06   1.941  0.05231 .  
## temperature      -1.612e-02  2.314e-03  -6.966 4.15e-12 ***
## usagecont        -7.252e-03  1.135e-03  -6.387 2.00e-10 ***
## carbflow          8.946e-06  3.745e-06   2.389  0.01697 *  
## density          -1.181e-01  2.777e-02  -4.252 2.20e-05 ***
## mfr              -7.817e-05  5.576e-05  -1.402  0.16105    
## balling          -1.464e-01  2.605e-02  -5.618 2.15e-08 ***
## pressurevacuum   -3.240e-02  7.885e-03  -4.110 4.09e-05 ***
## oxygenfiller     -3.800e-01  7.231e-02  -5.256 1.60e-07 ***
## bowlsetpoint      2.861e-03  5.905e-04   4.845 1.34e-06 ***
## pressuresetpoint -7.750e-03  1.962e-03  -3.950 8.02e-05 ***
## airpressurer     -1.756e-03  2.343e-03  -0.749  0.45364    
## alchrel           8.013e-02  2.658e-02   3.014  0.00260 ** 
## carbrel           4.685e-02  4.802e-02   0.976  0.32933    
## ballinglvl        1.852e-01  2.628e-02   7.049 2.32e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1312 on 2523 degrees of freedom
## Multiple R-squared:  0.4221, Adjusted R-squared:  0.4143 
## F-statistic: 54.21 on 34 and 2523 DF,  p-value: < 2.2e-16

We found strong levels of correlation (both positive and negative) for a variety of features. We can see stripes of dark color along balling and ballinglvl.

student.data.complete %>%
  cor(use='complete.obs') %>%
  corrplot(method='color', type='upper')

Some of our variables showed high correlation with each other. We can see stripes of dark color along balling and ballinglvl. We check the variables for multi-collinearity. Many of our variables were high in vif. To remove this concern in our regression models, we built a model that removed variables until all had a vif score of less than 10.

student.data.complete %>%
  select(c(1:2, 4:36)) %>%
  lm(data=., ph~.) %>%
  vif() %>%
  tibble('variable' = names(.), 'variable_vif' = .) %>%
  filter(variable_vif > 10) %>%
  ggplot(aes(variable, variable_vif)) + 
  geom_bar(stat='identity', fill='#b5c6fc') + 
  theme(panel.background = element_rect(fill = '#707996'),
        text = element_text(family = 'corben', color='#249382', size=35),
        axis.text.x = element_text(angle = 30, hjust = .9)
        ) + 
  labs(x = 'Variable', 
       y = 'VIF',
       title = 'Variables with the highest vif')

Below is a comparison of the distributions between each feature and the response variable PH.

student.data.complete %>%
  gather(key='key', value='value', -ph) %>%
  ggplot(aes(value, ph)) + 
  geom_point(alpha=.9,color='#65b285') + 
  facet_wrap(~key, scales='free_x', ncol=4) +
  labs(y='PH')

Model Development

We began our exploration for a predictive model by separating out 20% of the training data to serve as our final validation data. We also split each data frame into a list for easier use in testing various predictive models.

set.seed(123)

listify <- function(data){
  return(list('y' = data$ph, 'x' = data %>% select(-ph)))
}

part <- createDataPartition(student.data.complete$ph, p=0.8, list=FALSE)
training <- student.data.complete %>%
  filter(row_number() %in% part)
validation <- student.data.complete %>%
  filter(!row_number() %in% part)

student.data.complete <- listify(student.data.complete)
student.evaluation.complete <- listify(student.evaluation.complete)
training <- listify(training)
validation <- listify(validation)

We iterated over numerous different predictive models including a variety of different regression based and rule based models. The code for all of these tests is available in our repo. We ultimately settled on using a Random Forest model due to it achieving the best performance in our cross fold validated testing. We assessed the performance of each model based on the root mean square error. The tabs below show a selected set of the models we trained.

This model achieved an RMSE of \(\approx0.1003\) in our cross validated testing and an \(R^2 \approx 0.65\).

XG Boost

grid <- expand.grid(nrounds = 2500,
                    max_depth = 6,
                    eta = 0.03,
                    gamma = 0,
                    colsample_bytree = 1,
                    min_child_weight = 1,
                    subsample = 0.5)

control <- trainControl(method='cv',
                        number=10,
                        allowParallel = TRUE)

model <- train(x = training$x,
             y = training$y,
             method = 'xgbTree',
             tuneGrid = grid,
             metric = 'RMSE',
             trControl = control)

model$results
##   nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 1    2500         6 0.03     0                1                1       0.5
##        RMSE  Rsquared        MAE      RMSESD RsquaredSD      MAESD
## 1 0.1019613 0.6476099 0.07464376 0.008630701 0.05637404 0.00453064

The best XG Boost model performed nearly as well as the Random Forest model and would serve as a good alternative choice for a model, if needed.

Support Vector Machine

set.seed(200)
svmRTuned <- train(x = training$x,
                   y = training$y,
                   method="svmRadial", 
                   preProc=c("center","scale"),
                   tuneLength = 14,
                   trControl = trainControl(method="cv"))

svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2054 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1850, 1848, 1848, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1261216  0.4716002  0.09403762
##      0.50  0.1230421  0.4937594  0.09065211
##      1.00  0.1204268  0.5136178  0.08796257
##      2.00  0.1178681  0.5333699  0.08586277
##      4.00  0.1162084  0.5456875  0.08497896
##      8.00  0.1155904  0.5517457  0.08455720
##     16.00  0.1165009  0.5493467  0.08601967
##     32.00  0.1195934  0.5343728  0.08895020
##     64.00  0.1253487  0.5068460  0.09356035
##    128.00  0.1313385  0.4824173  0.09843447
##    256.00  0.1376892  0.4567386  0.10282025
##    512.00  0.1429095  0.4361877  0.10661876
##   1024.00  0.1434758  0.4346814  0.10693952
##   2048.00  0.1434758  0.4346814  0.10693952
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02056593
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02056593 and C = 8.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 8 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0205659307622407 
## 
## Number of Support Vectors : 1731 
## 
## Objective Function Value : -3582.52 
## Training error : 0.155419

Elastic Net

training_set_matrix<-as.matrix(training$x)

elastic_net_model<-glmnet(as.matrix(training_set_matrix[,-1]), training$y, family="gaussian", alpha=.65, standardize = TRUE)

elnet_predict<-predict(elastic_net_model, s=elastic_net_model$lambda.1se, newx=as.matrix(training_set_matrix[,-1]))
RMSE(elnet_predict, training$y)
## [1] 0.1386991

Random Forest

set.seed(200)
rf <- randomForest(x = training$x,
                   y = training$y, 
                   importance=TRUE,
                   ntree=1000)
mean(sqrt(rf$mse))
## [1] 0.1003276
mean(rf$rsq)
## [1] 0.6572274

Below we can see the order of importance for each of the predictors in determining the response variable. This table also indicates that there are a number of predictors that could possibly be dropped from the model with little loss to the model’s predictive value. This may be an interesting avenue of exploration if we wanted to prioritize a model that can be easily interpreted. As it stands, while this model is highly accurate, it is not easy to interpret.

rf$importance %>%
  as.data.frame() %>%
  rownames_to_column(var='predictor') %>%
  arrange(desc(`%IncMSE`)) %>%
  datatable()

With the model selected, we will finally run the model against the withheld validation data. This will give us a sense of how well the model will perform on data that it has never seen before. Our model scores a strong RMSE of approximately 0.0975 with an \(R^2\) of 0.7.

postResample(predict(rf, newdata=validation$x), validation$y)
##       RMSE   Rsquared        MAE 
## 0.09753956 0.70032732 0.07186069

With the model trained and validated, we can make our final predictions

Predictions

Before making our final predictions we are going to retrain the Random Forest model one last time. The model will be retrained with the same hyper-parameters, however this time we will train on ALL of the training data, including the previously withheld validation data. This step is often considered optional when creating a predictive model. Folding the validation data back into the model, while leaving everything else the same, can often allow our model to eek out just a bit more predictive ability due to the extra samples. However, the improvement is often very small (if at all) and as such this step will often be ignored if the training process is long. In our case, as the data set is small, we believe there is no downside to retraining the model.

final.rf <- randomForest(x = student.data.complete$x,
                   y = student.data.complete$y, 
                   importance=TRUE,
                   ntree=1000)

With the final model trained, we can finally make our predictions on the provided data.

predictions <- predict(final.rf, newdata=student.evaluation.complete$x)
predictions %>%
  tibble::enframe(name = NULL) %>%
  datatable()

Finally, as requested, we can write out the file in an Excel readable format.

predictions %>%
  tibble::enframe(name = NULL) %>%
  rownames_to_column() %>%
  rename(PH = value, 
         row = rowname) %>%
  write_excel_csv('./data/predictions.csv')