This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Data Exploration

training<-read_xlsx("StudentData.xlsx")
testing<-read_xlsx("StudentEvaluation.xlsx") %>% 
  mutate(PH = as.numeric(PH))

An initial glance at this dataset shows that our training set consists of a dimension of 2571 x 33 (rows x columns). Most of the features within this dataset are of a numeric nature with just 1 feature Brand Code being a categorical variable.

skim(training)

Data summary

Name	training
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	<U+2581><U+2586><U+2587><U+2585><U+2581>
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	<U+2581><U+2582><U+2587><U+2582><U+2581>
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	<U+2581><U+2583><U+2587><U+2582><U+2581>
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	<U+2581><U+2585><U+2587><U+2583><U+2581>
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	<U+2581><U+2585><U+2587><U+2583><U+2581>
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	<U+2586><U+2587><U+2583><U+2581><U+2581>
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	<U+2586><U+2587><U+2583><U+2581><U+2581>
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	<U+2587><U+2585><U+2582><U+2581><U+2581>
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	<U+2587><U+2581><U+2581><U+2587><U+2582>
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	<U+2581><U+2583><U+2587><U+2582><U+2581>
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	<U+2581><U+2581><U+2587><U+2582><U+2581>
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	<U+2587><U+2585><U+2582><U+2581><U+2581>
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	<U+2587><U+2582><U+2587><U+2585><U+2581>
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	<U+2587><U+2581><U+2583><U+2587><U+2581>
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	<U+2581><U+2583><U+2587><U+2582><U+2581>
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	<U+2581><U+2583><U+2585><U+2587><U+2581>
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	<U+2581><U+2581><U+2581><U+2581><U+2587>
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	<U+2587><U+2583><U+2581><U+2581><U+2581>
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	<U+2581><U+2583><U+2585><U+2583><U+2587>
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	<U+2582><U+2585><U+2586><U+2587><U+2581>
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	<U+2581><U+2585><U+2587><U+2582><U+2586>
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	<U+2581><U+2581><U+2581><U+2582><U+2587>
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	<U+2581><U+2587><U+2587><U+2581><U+2587>
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	<U+2582><U+2587><U+2586><U+2582><U+2581>
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	<U+2581><U+2585><U+2587><U+2582><U+2581>
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	<U+2587><U+2581><U+2581><U+2581><U+2581>
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	<U+2581><U+2582><U+2583><U+2587><U+2581>
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	<U+2581><U+2587><U+2581><U+2586><U+2581>
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	<U+2585><U+2587><U+2581><U+2581><U+2581>
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	<U+2581><U+2587><U+2582><U+2583><U+2581>
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	<U+2581><U+2587><U+2587><U+2582><U+2581>
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	<U+2581><U+2587><U+2582><U+2581><U+2586>

The figure below shows the distribution of the brand codes associated with the dataset, where we observe many a skewness towards brand B and a few unknown data points.

plot_bar(training)

In terms of numeric data, we can plot the histogram below with the DataExplorer package and assess the distributions of all of our features including pH. The distribution of these features do differ. pH follows a gaussian distribution which is beneficial for predictions. Many features have slight skewness or bimodal gaussian distributions. We will handle these features via boxcox transformations.

plot_histogram(training,ncol  = 3)

Additionally, we want to look at multicolinearity associated with the dataset. because the number of columns is quite large, we will only produce a scatter matrix using features with the highest pearson (linear) correlations. These relationships,you can see there are some strong relationships such as between filler speed vs MFR, and Balling vs Balling level, these features will need to be assessed and potentially aggregated or removed to mitigate noise in the dataset.

trainingnumeric<-training %>% 
  select(-`Brand Code`)
ADHD_cors<-
  bind_rows(
    #pearson correlation of numeric features
    trainingnumeric %>% 
      cor(method = "pearson", use = "pairwise.complete.obs") %>% as.data.frame() %>% 
      rownames_to_column(var = "x") %>% 
      pivot_longer(cols = -x, names_to = "y", values_to = "correlation") %>% 
      mutate(cor_type = "pearson"),
    #spearman (monotonic) correlations of numeric features
    trainingnumeric %>% 
      cor(method = "spearman", use = "pairwise.complete.obs") %>% as.data.frame() %>% 
      rownames_to_column(var = "x") %>% 
      pivot_longer(cols = -x, names_to = "y", values_to = "correlation") %>% 
      mutate(cor_type = "spearman")
  )
topcors<-ADHD_cors %>% 
  filter(!(x ==y)) %>% 
  filter(cor_type=="pearson") %>% 
  #distinct(correlation,.keep_all = T) %>% 
  arrange(-abs(correlation))%>% #top_n(10, correlation) %>% 
  head(10)
  
topcorvars<-c(pull(topcors,x),pull(topcors,y)) %>% unique

topcorset<-training %>% 
  select(all_of(c(topcorvars,'Brand Code')))
GGally::ggpairs(topcorset, progress = F, aes(color = `Brand Code`), title = "Top Correlations Within Training Set")+ theme_bw()

Since pH will be our main variable of interest, it is a good idea to observe any strong relationships between our predictor variables to pH, this time we will look at top spearman relationships to pH only. spearman provides a correlational strength of monotonicity which can capture both linear and non-linear relationships. We notice based on the figure below that most of the information seems a bit noisy, thus using a predictive machine learning model may provide value in identifying relationships.

topcorsPH<-ADHD_cors %>% 
  filter(!(x ==y)) %>% 
  filter(cor_type=="spearman") %>% 
  #distinct(correlation,.keep_all = T) %>% 
  arrange(-abs(correlation))%>%
  filter(x == "PH" | y == "PH") %>% 
  head(10)
topcorvarsPH<-c(pull(topcorsPH,x),pull(topcorsPH,y)) %>% unique

topcorsetPH<-training %>% 
  select(all_of(c(topcorvarsPH,'Brand Code')))
GGally::ggpairs(topcorsetPH, progress = F, aes(color = `Brand Code`), title = "Top Correlations to pH Within Training Set")+ theme_bw()

Data Transformation

datatrans<-training %>% recipe(~.) %>% 
  step_mutate_at(contains(c("hyd pressure", "mnf flow","filler speed","carb flow", "balling")), fn = ~ifelse(.<0,NA,.)) %>% 
  step_mutate_at(`Carb Volume`,`Filler Level`,PSC, `Air Pressurer`, `Bowl Setpoint`,`Oxygen Filler`,Temperature,MFR, fn = ~BoxCox(., lambda = BoxCox.lambda(.))) %>%
  step_mutate_at(`Brand Code`, fn = ~as.factor(.)) %>% 
  step_impute_knn(all_predictors()) %>% 
  step_rename(BrandCode=`Brand Code`) %>% 
  step_dummy(all_nominal()) %>% 
  step_corr(all_predictors(),threshold = 0.9) %>% 
  step_nzv(all_predictors()) %>% 
  prep()

training_cleaned<-
datatrans %>% bake(training)

For the Data preparation following baseline steps were made and the reasons are provided below

Remove any NA values associated with pH: Because pH will be our response variable in this project, we will omit any observations that does not have pH data (if any).
Removing Negative Values: Removing negative values associated with features where negative values dont make sense such as Carb Volume, Filler Level, PSC, Air Pressurer, Bowl Setpoint, and Oxygen Filler.
BoxCox Transformations: Features need to be normalized such that the distributions follow gaussian curves they are centered and scaled the mean is 0 and the Stdev is 1, this scales all the data to allow kmeans to appropriately place centroids and observations at appropriate distances.
Imputation of missing data with KNN: the remaining data was imputed with K-nearestneighbors (KNN) as a way to fill in missing gaps. alternative methods include median, mean, or bag imputations but it was determined that KNN provides the best results with minimal effect on computational effort.
Dummifying Variables: Categorical variables were binarized into 0/1. This is particularly important for certain algorithms such as linear regression that are unable to understand categorical variables. For example a feature with categories 1,2,3 will binarizing such that for example 3 would be its own column with 0/1 for presence/absence.
Colinearity test: Colinearity was tested and it was determined that there were a few features that were removed due to colinearity. the threshold used was 0.9 on a pearson-based relationship. Below shows the features that were removed due to colinearity

datatrans$steps[[7]]$removals

[1] "Hyd Pressure3" "Balling"       "Bowl Setpoint" "Alch Rel"     
[5] "Balling Lvl"

Removed low-variance features: Removing any extremely low-variance data that will not provide useful data to the model and will only contribute to noise. At first glance, From Data Exploration section there doesnt appear to be any clear indicators of a low-variance variable with majority of categories recorded at a particular value. This is confirmed statistically with tidymodels where no features were removed due to low variance as shown below

datatrans$steps[[8]]$removals

character(0)

Build Models

A series of regression models were built to predict pH values in the bottling process. Models were built on the training dataset using the caret package with 10-fold cross validation to determine optimal values of tuning parameters and were assessed by quantitative comparison of accuracy metrics Root Mean Square Error (RMSE), Mean Average Error (MAE), and R squared. By tuning these hyperparameters, a balance in the bias-variance trad-off can be reached. In addition to making predictions, the variables that most significantly influence pH are identified so that future decisions regarding bottle process streamlining can be effectively made.

The training dataset contains 2751 observations of 32 predictor variables and the target, pH. The test set contains 627 observations. Values of all variables were centered and scaled to facilitate model algorithms and allow for direct comparison of variable importance to a given model.

Models that were considered for this analysis include:

Linear Regression:
- Partial Least Squares
Non-Linear Regression:
- Neural Network
- K-Nearest Neighbors
- Support Vector Machines
Decision Trees:
- Random Forest

Partial Least Squares

A Partial Least Squares (PLS) model was built due to correlations ween in the predictor variables. The model was tuned across 10 principle components (ncomp) from 1 to 10. ncomp was computed using cross-validation to determine the value that minimized RMSE, and was determined to be ncomp=7.

#  tuning paramter ncomp; number of components
set.seed(0515)
pls_mod <- train(x=trainX, y=trainY$PH,
      method="pls",
      tuneLength=10, 
      trControl=ctrl)

# plot model tuning
pls_mod %>% 
  ggplot() + 
      ggtitle("Results of PLS cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# make predictions on test set data
pls_pred <- predict(pls_mod, newdata = testX) %>% 
  as.data.frame()

# get accuracy metrics of test set data 
pls_acc <- postResample(pred=pls_pred, obs=testY)

K-Nearest Neighbors

K-nearest neighbors (KNN) regression models predict a new sample using the K-closest samples from the training set. For the model built, a Euclidean distance was used and optimal value for K, the number of neighbors used for each calculation, was determined from cross validation varying 8 (odd numbers from 5 to 19) values of K. The value of K that minimized RMSE was K=7, and was used as the tuning parameter in the model.

# train model; tuning parameter k, number of neighbors
set.seed(111)
knn_mod <- train(x=trainX, y=trainY$PH,
                  method="knn",
                  tuneLength=8,
                  trControl=ctrl)

knn_mod %>% 
  ggplot() + 
      ggtitle("Results of KNN cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# make predictions on test set data
knn_pred <- predict(knn_mod, newdata = testX) %>% as.data.frame()
# get accuracy metrics of test set data 
knn_acc <- postResample(pred=knn_pred, obs=testY)

Support Vector Machines

Support Vector Machines (SVM) regression model was tested using a linear (not shown) and radial kernel function. The radial kernel outperformed the linear. Cross validation was used to test varying values of Cost (C) which is a term that penalizes large residuals, and sigma which sets a threshold distance that disqualifies points from being included in the distance calculation of a given support vector. The values of C and sigma included a matrix of all pair-wise combinations of C=0.01, 0.1, 1, 2 and sigma= 0.01, 0.1. The combination that minimized RMSE was sigma = 0.1 and C = 2.

# train model 
set.seed(2021)
grid <- expand.grid(C=c(0.01, 0.1, 1, 2),sigma=c(0.01, 0.1))

svm_mod <- train(x=trainX, y=trainY$PH,
                  method="svmRadial",
                  tuneLength=10, 
                  trControl=ctrl,
                 tuneGrid=grid)

# plot
svm_mod %>% 
  ggplot() + 
      ggtitle("Results of SVM cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# make predictions on test set data
svm_pred <- predict(svm_mod, newdata = testX) %>% as.data.frame()
# get accuracy metrics of test set data 
svm_acc <- postResample(pred=svm_pred, obs=testY)

Neural Network

For the Neural NEtwork (nnet) model. The hyperparameter size refers to the number of hidden layers of the neural net. Weight decay is a term used to regularize the model, penalizing large coefficients. Varying values for size and decay were tested in the model training process and included all pair-wise combinations of decay=0.001, 0.01, 0.1 and size= integers from 1 to 10. Highly correlated predictors were filtered prior to the model run since nnet models are highly sensitive.

Tuning parameters: size = 9, decay = 0.01 and bag = FALSE.

# remove high correlated variables
tooHigh <- findCorrelation(cor(trainX), cutoff=.75)
trainXnnet <- trainX[,-tooHigh]
testXnnet <- testX[,-tooHigh]

# Grid of hyperparameters
nnetGrid <- expand.grid(.decay=c(0.001, 0.01,.1),
                        .size=c(1:10),
                        .bag=F)

# run model
nnet_mod <- train(x=trainXnnet, y=trainY$PH,
                  method="avNNet",
                  tuneGrid=nnetGrid,
                  trControl=ctrl,
                  linout=T,
                  trace=F,
                  MaxNWts = 10 * (ncol(training_cleaned)+1) + 10 +1,
                  maxit=500
                  )

# plot
nnet_mod %>% 
  ggplot() + 
      ggtitle("Results of nnet cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# predict
nnet_pred <- predict(nnet_mod$finalModel, newdata = testXnnet) %>% as.data.frame()
nnet_acc <- postResample(nnet_pred, obs=testY)

MARS

Multivariate adaptive regression splines (MARS) is a non-parametric algorithm that creates a piece-wise linear model to capture nonlinearities and interactions effects. A MARS model was fit by varying hyperparameters degree(maximum interaction term) and nprune(maximum terms in final model). The final values used for the model were nprune = 21 and degree = 3.

# create MARS grid for hyperparameters
marsGrid <- expand.grid(.degree=1:3, .nprune=4:25)

# train model
mars_mod <- train(x=trainX, y=trainY$PH,
                   method="earth",
                   tuneGrid = marsGrid,
                   trControl=ctrl
                   )
#plot
mars_mod %>% 
  ggplot() + 
      ggtitle("Results of MARS cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# predict
mars_pred <- predict(mars_mod, newdata=testX) %>% data.frame()
mars_acc <- postResample(mars_pred, obs=testY)

Random Forest

The final model built was a random forest. Cross-validation was used to optimize the number of predictors used as candidates at each iteration split. The final value selected for the model was mtry=11

set.seed(1984)


rf_mod <- train(x=trainX, y=trainY$PH,
      method="rf",
      tuneLength = 10,
      trControl = ctrl)

#plot
rf_mod %>% 
  ggplot() + 
      ggtitle("Results of random forest cross validation") + 
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))

# predict
rf_pred <- predict(rf_mod, newdata=testX) %>% data.frame()
rf_acc <- postResample(rf_pred, obs=testY)

Model Selection

The six candidate models were then evaluated based on accuracy metrix on the test set, data unseen to the model, and therefore indicative of predictive power. Accuracy metrics evaluated include RMSE, R-squared, and MAE, and are provided in tabular and graphical form.

Table of Results

table <- data.frame("PLS" = pls_acc %>% round(3),
                    "KNN" = knn_acc %>% round(3),
                    "SVM" = svm_acc %>% round(3) , 
                    "nNet" = nnet_acc %>% round(3), 
                    "MARS" = mars_acc %>% round(3), 
                    "RF" = rf_acc %>% round(3)) %>%  
  t %>% 
  data.frame

table$tuning_parameter <- c("ncomp", "k", "signma, C",  "size, decay", "nprune, degree", "mytry"  ) 

table$value <- c("7", "7", "0.1, 2", "9, 0.01", "21, 3", "11")     

table

Barplot of results

table$model <- rownames(table)
table %>% 
  select(RMSE, Rsquared, MAE,model) %>% 
  pivot_longer(c(RMSE,Rsquared,MAE)) %>%
  arrange(value) %>% 
  ggplot(aes(x=model,y=value))+ 
  geom_bar(stat="identity", col="black",fill="dodgerblue", alpha=.5)+
  geom_text(aes(label=value),vjust=1.5, color="white")+
  facet_wrap(~name, nrow=3, scales = "free_y") +
      theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black")
       )

Based on the performance of models on the test set, the random forest model was selected as the best predictor of pH in the bottling process. Predictions made using the training and tuned random forest model are provided in the included Excel document. Finally, calculations of variable importance to the random forest model are provided so future decisions regarding bottling procedures can targeted in ways that effectively manage pH levels.

# predictions 
#predictions <- predict(rf_mod, testX)
# cbind(predictions,testing) %>% 
#   write.csv('predictions.csv')
varImp(rf_mod)%>% 
  plot

DATA624: Project 2

Joshua Registe, Rob Hodde, Rob Welk

5/22/2021