This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Data Exploration

training<-read_xlsx("StudentData.xlsx")
testing<-read_xlsx("StudentEvaluation.xlsx") %>% 
  mutate(PH = as.numeric(PH))

An initial glance at this dataset shows that our training set consists of a dimension of 2571 x 33 (rows x columns). Most of the features within this dataset are of a numeric nature with just 1 feature Brand Code being a categorical variable.

skim(training)
Data summary
Name training
Number of rows 2571
Number of columns 33
_______________________
Column type frequency:
character 1
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand Code 120 0.95 1 1 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Carb Volume 10 1.00 5.37 0.11 5.04 5.29 5.35 5.45 5.70 <U+2581><U+2586><U+2587><U+2585><U+2581>
Fill Ounces 38 0.99 23.97 0.09 23.63 23.92 23.97 24.03 24.32 <U+2581><U+2582><U+2587><U+2582><U+2581>
PC Volume 39 0.98 0.28 0.06 0.08 0.24 0.27 0.31 0.48 <U+2581><U+2583><U+2587><U+2582><U+2581>
Carb Pressure 27 0.99 68.19 3.54 57.00 65.60 68.20 70.60 79.40 <U+2581><U+2585><U+2587><U+2583><U+2581>
Carb Temp 26 0.99 141.09 4.04 128.60 138.40 140.80 143.80 154.00 <U+2581><U+2585><U+2587><U+2583><U+2581>
PSC 33 0.99 0.08 0.05 0.00 0.05 0.08 0.11 0.27 <U+2586><U+2587><U+2583><U+2581><U+2581>
PSC Fill 23 0.99 0.20 0.12 0.00 0.10 0.18 0.26 0.62 <U+2586><U+2587><U+2583><U+2581><U+2581>
PSC CO2 39 0.98 0.06 0.04 0.00 0.02 0.04 0.08 0.24 <U+2587><U+2585><U+2582><U+2581><U+2581>
Mnf Flow 2 1.00 24.57 119.48 -100.20 -100.00 65.20 140.80 229.40 <U+2587><U+2581><U+2581><U+2587><U+2582>
Carb Pressure1 32 0.99 122.59 4.74 105.60 119.00 123.20 125.40 140.20 <U+2581><U+2583><U+2587><U+2582><U+2581>
Fill Pressure 22 0.99 47.92 3.18 34.60 46.00 46.40 50.00 60.40 <U+2581><U+2581><U+2587><U+2582><U+2581>
Hyd Pressure1 11 1.00 12.44 12.43 -0.80 0.00 11.40 20.20 58.00 <U+2587><U+2585><U+2582><U+2581><U+2581>
Hyd Pressure2 15 0.99 20.96 16.39 0.00 0.00 28.60 34.60 59.40 <U+2587><U+2582><U+2587><U+2585><U+2581>
Hyd Pressure3 15 0.99 20.46 15.98 -1.20 0.00 27.60 33.40 50.00 <U+2587><U+2581><U+2583><U+2587><U+2581>
Hyd Pressure4 30 0.99 96.29 13.12 52.00 86.00 96.00 102.00 142.00 <U+2581><U+2583><U+2587><U+2582><U+2581>
Filler Level 20 0.99 109.25 15.70 55.80 98.30 118.40 120.00 161.20 <U+2581><U+2583><U+2585><U+2587><U+2581>
Filler Speed 57 0.98 3687.20 770.82 998.00 3888.00 3982.00 3998.00 4030.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
Temperature 14 0.99 65.97 1.38 63.60 65.20 65.60 66.40 76.20 <U+2587><U+2583><U+2581><U+2581><U+2581>
Usage cont 5 1.00 20.99 2.98 12.08 18.36 21.79 23.75 25.90 <U+2581><U+2583><U+2585><U+2583><U+2587>
Carb Flow 2 1.00 2468.35 1073.70 26.00 1144.00 3028.00 3186.00 5104.00 <U+2582><U+2585><U+2586><U+2587><U+2581>
Density 1 1.00 1.17 0.38 0.24 0.90 0.98 1.62 1.92 <U+2581><U+2585><U+2587><U+2582><U+2586>
MFR 212 0.92 704.05 73.90 31.40 706.30 724.00 731.00 868.60 <U+2581><U+2581><U+2581><U+2582><U+2587>
Balling 1 1.00 2.20 0.93 -0.17 1.50 1.65 3.29 4.01 <U+2581><U+2587><U+2587><U+2581><U+2587>
Pressure Vacuum 0 1.00 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 <U+2582><U+2587><U+2586><U+2582><U+2581>
PH 4 1.00 8.55 0.17 7.88 8.44 8.54 8.68 9.36 <U+2581><U+2585><U+2587><U+2582><U+2581>
Oxygen Filler 12 1.00 0.05 0.05 0.00 0.02 0.03 0.06 0.40 <U+2587><U+2581><U+2581><U+2581><U+2581>
Bowl Setpoint 2 1.00 109.33 15.30 70.00 100.00 120.00 120.00 140.00 <U+2581><U+2582><U+2583><U+2587><U+2581>
Pressure Setpoint 12 1.00 47.62 2.04 44.00 46.00 46.00 50.00 52.00 <U+2581><U+2587><U+2581><U+2586><U+2581>
Air Pressurer 0 1.00 142.83 1.21 140.80 142.20 142.60 143.00 148.20 <U+2585><U+2587><U+2581><U+2581><U+2581>
Alch Rel 9 1.00 6.90 0.51 5.28 6.54 6.56 7.24 8.62 <U+2581><U+2587><U+2582><U+2583><U+2581>
Carb Rel 10 1.00 5.44 0.13 4.96 5.34 5.40 5.54 6.06 <U+2581><U+2587><U+2587><U+2582><U+2581>
Balling Lvl 1 1.00 2.05 0.87 0.00 1.38 1.48 3.14 3.66 <U+2581><U+2587><U+2582><U+2581><U+2586>

The figure below shows the distribution of the brand codes associated with the dataset, where we observe many a skewness towards brand B and a few unknown data points.

plot_bar(training)

In terms of numeric data, we can plot the histogram below with the DataExplorer package and assess the distributions of all of our features including pH. The distribution of these features do differ. pH follows a gaussian distribution which is beneficial for predictions. Many features have slight skewness or bimodal gaussian distributions. We will handle these features via boxcox transformations.

plot_histogram(training,ncol  = 3)

Additionally, we want to look at multicolinearity associated with the dataset. because the number of columns is quite large, we will only produce a scatter matrix using features with the highest pearson (linear) correlations. These relationships,you can see there are some strong relationships such as between filler speed vs MFR, and Balling vs Balling level, these features will need to be assessed and potentially aggregated or removed to mitigate noise in the dataset.

trainingnumeric<-training %>% 
  select(-`Brand Code`)
ADHD_cors<-
  bind_rows(
    #pearson correlation of numeric features
    trainingnumeric %>% 
      cor(method = "pearson", use = "pairwise.complete.obs") %>% as.data.frame() %>% 
      rownames_to_column(var = "x") %>% 
      pivot_longer(cols = -x, names_to = "y", values_to = "correlation") %>% 
      mutate(cor_type = "pearson"),
    #spearman (monotonic) correlations of numeric features
    trainingnumeric %>% 
      cor(method = "spearman", use = "pairwise.complete.obs") %>% as.data.frame() %>% 
      rownames_to_column(var = "x") %>% 
      pivot_longer(cols = -x, names_to = "y", values_to = "correlation") %>% 
      mutate(cor_type = "spearman")
  )
topcors<-ADHD_cors %>% 
  filter(!(x ==y)) %>% 
  filter(cor_type=="pearson") %>% 
  #distinct(correlation,.keep_all = T) %>% 
  arrange(-abs(correlation))%>% #top_n(10, correlation) %>% 
  head(10)
  
topcorvars<-c(pull(topcors,x),pull(topcors,y)) %>% unique
topcorset<-training %>% 
  select(all_of(c(topcorvars,'Brand Code')))
GGally::ggpairs(topcorset, progress = F, aes(color = `Brand Code`), title = "Top Correlations Within Training Set")+ theme_bw()

Since pH will be our main variable of interest, it is a good idea to observe any strong relationships between our predictor variables to pH, this time we will look at top spearman relationships to pH only. spearman provides a correlational strength of monotonicity which can capture both linear and non-linear relationships. We notice based on the figure below that most of the information seems a bit noisy, thus using a predictive machine learning model may provide value in identifying relationships.

topcorsPH<-ADHD_cors %>% 
  filter(!(x ==y)) %>% 
  filter(cor_type=="spearman") %>% 
  #distinct(correlation,.keep_all = T) %>% 
  arrange(-abs(correlation))%>%
  filter(x == "PH" | y == "PH") %>% 
  head(10)
topcorvarsPH<-c(pull(topcorsPH,x),pull(topcorsPH,y)) %>% unique
topcorsetPH<-training %>% 
  select(all_of(c(topcorvarsPH,'Brand Code')))
GGally::ggpairs(topcorsetPH, progress = F, aes(color = `Brand Code`), title = "Top Correlations to pH Within Training Set")+ theme_bw()

Data Transformation

datatrans<-training %>% recipe(~.) %>% 
  step_mutate_at(contains(c("hyd pressure", "mnf flow","filler speed","carb flow", "balling")), fn = ~ifelse(.<0,NA,.)) %>% 
  step_mutate_at(`Carb Volume`,`Filler Level`,PSC, `Air Pressurer`, `Bowl Setpoint`,`Oxygen Filler`,Temperature,MFR, fn = ~BoxCox(., lambda = BoxCox.lambda(.))) %>%
  step_mutate_at(`Brand Code`, fn = ~as.factor(.)) %>% 
  step_impute_knn(all_predictors()) %>% 
  step_rename(BrandCode=`Brand Code`) %>% 
  step_dummy(all_nominal()) %>% 
  step_corr(all_predictors(),threshold = 0.9) %>% 
  step_nzv(all_predictors()) %>% 
  prep()

training_cleaned<-
datatrans %>% bake(training)

For the Data preparation following baseline steps were made and the reasons are provided below

datatrans$steps[[7]]$removals
[1] "Hyd Pressure3" "Balling"       "Bowl Setpoint" "Alch Rel"     
[5] "Balling Lvl"  
datatrans$steps[[8]]$removals
character(0)

Build Models

A series of regression models were built to predict pH values in the bottling process. Models were built on the training dataset using the caret package with 10-fold cross validation to determine optimal values of tuning parameters and were assessed by quantitative comparison of accuracy metrics Root Mean Square Error (RMSE), Mean Average Error (MAE), and R squared. By tuning these hyperparameters, a balance in the bias-variance trad-off can be reached. In addition to making predictions, the variables that most significantly influence pH are identified so that future decisions regarding bottle process streamlining can be effectively made.

The training dataset contains 2751 observations of 32 predictor variables and the target, pH. The test set contains 627 observations. Values of all variables were centered and scaled to facilitate model algorithms and allow for direct comparison of variable importance to a given model.

Models that were considered for this analysis include:

Partial Least Squares

A Partial Least Squares (PLS) model was built due to correlations ween in the predictor variables. The model was tuned across 10 principle components (ncomp) from 1 to 10. ncomp was computed using cross-validation to determine the value that minimized RMSE, and was determined to be ncomp=7.

#  tuning paramter ncomp; number of components
set.seed(0515)
pls_mod <- train(x=trainX, y=trainY$PH,
      method="pls",
      tuneLength=10, 
      trControl=ctrl)

# plot model tuning
pls_mod %>% 
  ggplot() + 
      ggtitle("Results of PLS cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# make predictions on test set data
pls_pred <- predict(pls_mod, newdata = testX) %>% 
  as.data.frame()

# get accuracy metrics of test set data 
pls_acc <- postResample(pred=pls_pred, obs=testY)

K-Nearest Neighbors

K-nearest neighbors (KNN) regression models predict a new sample using the K-closest samples from the training set. For the model built, a Euclidean distance was used and optimal value for K, the number of neighbors used for each calculation, was determined from cross validation varying 8 (odd numbers from 5 to 19) values of K. The value of K that minimized RMSE was K=7, and was used as the tuning parameter in the model.

# train model; tuning parameter k, number of neighbors
set.seed(111)
knn_mod <- train(x=trainX, y=trainY$PH,
                  method="knn",
                  tuneLength=8,
                  trControl=ctrl)

knn_mod %>% 
  ggplot() + 
      ggtitle("Results of KNN cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# make predictions on test set data
knn_pred <- predict(knn_mod, newdata = testX) %>% as.data.frame()
# get accuracy metrics of test set data 
knn_acc <- postResample(pred=knn_pred, obs=testY)

Support Vector Machines

Support Vector Machines (SVM) regression model was tested using a linear (not shown) and radial kernel function. The radial kernel outperformed the linear. Cross validation was used to test varying values of Cost (C) which is a term that penalizes large residuals, and sigma which sets a threshold distance that disqualifies points from being included in the distance calculation of a given support vector. The values of C and sigma included a matrix of all pair-wise combinations of C=0.01, 0.1, 1, 2 and sigma= 0.01, 0.1. The combination that minimized RMSE was sigma = 0.1 and C = 2.

# train model 
set.seed(2021)
grid <- expand.grid(C=c(0.01, 0.1, 1, 2),sigma=c(0.01, 0.1))

svm_mod <- train(x=trainX, y=trainY$PH,
                  method="svmRadial",
                  tuneLength=10, 
                  trControl=ctrl,
                 tuneGrid=grid)

# plot
svm_mod %>% 
  ggplot() + 
      ggtitle("Results of SVM cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# make predictions on test set data
svm_pred <- predict(svm_mod, newdata = testX) %>% as.data.frame()
# get accuracy metrics of test set data 
svm_acc <- postResample(pred=svm_pred, obs=testY)

Neural Network

For the Neural NEtwork (nnet) model. The hyperparameter size refers to the number of hidden layers of the neural net. Weight decay is a term used to regularize the model, penalizing large coefficients. Varying values for size and decay were tested in the model training process and included all pair-wise combinations of decay=0.001, 0.01, 0.1 and size= integers from 1 to 10. Highly correlated predictors were filtered prior to the model run since nnet models are highly sensitive.

Tuning parameters: size = 9, decay = 0.01 and bag = FALSE.

# remove high correlated variables
tooHigh <- findCorrelation(cor(trainX), cutoff=.75)
trainXnnet <- trainX[,-tooHigh]
testXnnet <- testX[,-tooHigh]

# Grid of hyperparameters
nnetGrid <- expand.grid(.decay=c(0.001, 0.01,.1),
                        .size=c(1:10),
                        .bag=F)

# run model
nnet_mod <- train(x=trainXnnet, y=trainY$PH,
                  method="avNNet",
                  tuneGrid=nnetGrid,
                  trControl=ctrl,
                  linout=T,
                  trace=F,
                  MaxNWts = 10 * (ncol(training_cleaned)+1) + 10 +1,
                  maxit=500
                  )

# plot
nnet_mod %>% 
  ggplot() + 
      ggtitle("Results of nnet cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# predict
nnet_pred <- predict(nnet_mod$finalModel, newdata = testXnnet) %>% as.data.frame()
nnet_acc <- postResample(nnet_pred, obs=testY)

MARS

Multivariate adaptive regression splines (MARS) is a non-parametric algorithm that creates a piece-wise linear model to capture nonlinearities and interactions effects. A MARS model was fit by varying hyperparameters degree(maximum interaction term) and nprune(maximum terms in final model). The final values used for the model were nprune = 21 and degree = 3.

# create MARS grid for hyperparameters
marsGrid <- expand.grid(.degree=1:3, .nprune=4:25)

# train model
mars_mod <- train(x=trainX, y=trainY$PH,
                   method="earth",
                   tuneGrid = marsGrid,
                   trControl=ctrl
                   )
#plot
mars_mod %>% 
  ggplot() + 
      ggtitle("Results of MARS cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# predict
mars_pred <- predict(mars_mod, newdata=testX) %>% data.frame()
mars_acc <- postResample(mars_pred, obs=testY)

Random Forest

The final model built was a random forest. Cross-validation was used to optimize the number of predictors used as candidates at each iteration split. The final value selected for the model was mtry=11

set.seed(1984)


rf_mod <- train(x=trainX, y=trainY$PH,
      method="rf",
      tuneLength = 10,
      trControl = ctrl)

#plot
rf_mod %>% 
  ggplot() + 
      ggtitle("Results of random forest cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# predict
rf_pred <- predict(rf_mod, newdata=testX) %>% data.frame()
rf_acc <- postResample(rf_pred, obs=testY)

Model Selection

The six candidate models were then evaluated based on accuracy metrix on the test set, data unseen to the model, and therefore indicative of predictive power. Accuracy metrics evaluated include RMSE, R-squared, and MAE, and are provided in tabular and graphical form.

Table of Results

table <- data.frame("PLS" = pls_acc %>% round(3),
                    "KNN" = knn_acc %>% round(3),
                    "SVM" = svm_acc %>% round(3) , 
                    "nNet" = nnet_acc %>% round(3), 
                    "MARS" = mars_acc %>% round(3), 
                    "RF" = rf_acc %>% round(3)) %>%  
  t %>% 
  data.frame

table$tuning_parameter <- c("ncomp", "k", "signma, C",  "size, decay", "nprune, degree", "mytry"  ) 

table$value <- c("7", "7", "0.1, 2", "9, 0.01", "21, 3", "11")     

table 

Barplot of results

table$model <- rownames(table)
table %>% 
  select(RMSE, Rsquared, MAE,model) %>% 
  pivot_longer(c(RMSE,Rsquared,MAE)) %>%
  arrange(value) %>% 
  ggplot(aes(x=model,y=value))+ 
  geom_bar(stat="identity", col="black",fill="dodgerblue", alpha=.5)+
  geom_text(aes(label=value),vjust=1.5, color="white")+
  facet_wrap(~name, nrow=3, scales = "free_y") +
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black")
       )

Based on the performance of models on the test set, the random forest model was selected as the best predictor of pH in the bottling process. Predictions made using the training and tuned random forest model are provided in the included Excel document. Finally, calculations of variable importance to the random forest model are provided so future decisions regarding bottling procedures can targeted in ways that effectively manage pH levels.

# predictions 
#predictions <- predict(rf_mod, testX)
# cbind(predictions,testing) %>% 
#   write.csv('predictions.csv')
varImp(rf_mod)%>% 
  plot