1 Problem Statement

Data related to the manufacturing process of a soft drink are provided, including values of various parameters that control the process. The objective is to build a predictive model to predict the pH content based on the manufacturing process data.

2 Executive Summary

Since predicting the value of a numerical variable such as PH is a regression problem, our methodology consists of building five different models for the problem, and ranking their performance on a holdout portion of the training data provided. We then selected the model with the lowest RMSE value on the holdout data set. We developed predictors using the following models:

- Support Vector Machine
- Stochastic Gradient Boosting
- Random Forest
- Neural Network
- Generalized Linear Model

Comparing the model prediction errors on the holdout data set, we found that the Random Forest model consistently produced the best results, and we therefore selected it to make predictions on the test data set that was provided.

We also examined the most important predictors related to PH values. The more we understand the relationships between PH and the variable, the better we can manage PH through manipulating those variables effectively. The top 1 - 6 of most important variables are:

- Mnf.FLow,
- Band.Code_C (via one-hot-encoding of Brand.Code),
- Pressure.Vacuum,
- Oxygen.Filler, and
- Temperature.

2.1 Dependencies

Our code has dependencies on the following R packages.

library(caret)
library(caTools)
library(corrplot)
library(e1071)
library(fastDummies)
library(forecast)
library(ggplot2)
library(imputeTS)
library(lattice)
library(knitr)
library(ModelMetrics)
library(nnet)
library(randomForest)
library(readxl)
library(reshape2)
library(tidyr)
library(tidyverse)
library(xlsx)

# Show loaded packages.
(.packages())

##  [1] "xlsx"         "forcats"      "stringr"      "dplyr"       
##  [5] "purrr"        "readr"        "tibble"       "tidyverse"   
##  [9] "tidyr"        "reshape2"     "readxl"       "randomForest"
## [13] "nnet"         "ModelMetrics" "knitr"        "imputeTS"    
## [17] "forecast"     "fastDummies"  "e1071"        "corrplot"    
## [21] "caTools"      "caret"        "ggplot2"      "lattice"     
## [25] "stats"        "graphics"     "grDevices"    "utils"       
## [29] "datasets"     "methods"      "base"

3 Data Initialization and Preprocessing

Below we can see a sample of the data, as read from Excel files. Our target variable is pH and each data vector, except for brand code, is numeric. However, we can convert this to a numeric vector easily. There are also some missing values that must be imputed. The only non-numeric data is the brand code.

training <- read_excel("StudentData.xlsx")
test     <- read_excel("StudentEvaluation.xlsx")

training <- dummy_cols(training, select_columns = training$Brand.Code)
test <- dummy_cols(test, select_columns = test$Brand.Code)

training <- as.data.frame(sapply(training, as.numeric))
test     <- as.data.frame(sapply(test, as.numeric))

names(training) <- make.names(names(training))
names(test) <- make.names(names(test))

training$Brand.Code <- NULL
test$Brand.Code <- NULL
head(training)

3.1 Imputation of Missing Values

Below we impute the missing values using a monotone cubic approximator (known as a Stineman interpolation). It has a tendency to perform well on linear as well as higher-order data vectors.

training <- na.interpolation(training, option = 'stine')
test <- na.interpolation(test, option = 'stine')

Below we can see the mean and standard deviation for each data vector. As we can see, our data occurs across many orders of magnitude. For the best fit, the data should be centered and scaled.

means <- sapply(training, mean, na.rm = TRUE)
sds   <- sapply(training, sd, na.rm = TRUE)
explore <- as.data.frame(cbind( means, sds))
ggplot(explore, aes(x = row.names(explore), y = means))+ 
  geom_bar(stat = 'identity') + 
  labs(title = "Means of Various Features") + 
  xlab("Data Features") + 
  ylab("Mean of Data") +
  theme(panel.background = element_blank()) + 
  geom_errorbar(aes(ymin = means - sds, ymax = means + sds))

These plots confirm the non-normality of most of our data. For the non-linear models, we must center and scale them. Additionally, we will need other data transformations (discussed below).

ggplot(data = gather(training), mapping = aes(x = value)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="lightgrey")+
  facet_wrap(~key, ncol = 1, scales = 'free')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.2 Correlation Plot of Predictors

We can see significant covariance in the data. Additionally, many data points have near-zero variance. Excluding these confounding variables will improve our model.

results <- cor(training, method = 'pearson')
corrplot::corrplot(results, method = 'circle')

3.3 Checking for Missing Values

As we can see, four rows do not have pH data provided. Therefore, we will exclude them from training.

## Response variable missing value 
#table(is.na(training$PH))
## remove rows with missing response variable
training <- training[!is.na(training$PH), ]

## near zero value predictors
nzv <- nearZeroVar(training, saveMetrics= TRUE)
nzv[nzv$nzv,]

3.4 Data Pre-processing

For preprocessing of data, we remove near zero predictors, fill in missing values with KNN method, and transform predictors using the Yeo-Johnson transformation method. We also center and scale the data. Additionally, we remove the covariant terms and the ones with near zero variance here, since they will not improve our models.

x_train <- subset(training, select = -PH )
y_train <- training$PH

x_test <- subset(test, select = -PH )
y_test <- test$PH

preProcValues <- preProcess(x_train, method = c("center", "scale", "YeoJohnson", "nzv", "corr"))

trainTransformed <- predict(preProcValues, x_train)
testTransformed <- predict(preProcValues, x_test)

4 Support Vector Machine

Comparing SVM with Neural Networks (NN), both are non-linear algorithms. A Support Vector Machine with different kernels is comparable to a Neural Network with different layers. One advantage SVMs have over NNs is that NNs need large amounts of data to train, SVMs work with smaller-sized data with less computing power. Finally SVM usually only have 2-3 parameters to tune, they are easy to code, and the results are explainable. On the other hand, SVMs might not beat NNs on the accuracy metric.

set.seed(100)

y_train.orig = y_train
sample = sample.split(y_train.orig, SplitRatio = .75)
y_train_svm = subset(y_train.orig, sample == TRUE)
y_test_svm = subset(y_train.orig, sample == FALSE)

training_svm = subset(trainTransformed, sample == TRUE)
test_svm = subset(trainTransformed, sample == FALSE)

fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 3,
                           ## repeated ten times
                           repeats = 3)
svmFit <- train(training_svm, y_train_svm,
                 method = "svmRadial",
                 trControl = fitControl,
                 tuneLength = 8,
                 metric = "RMSE")
svmFit

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1931 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 3 times) 
## Summary of sample sizes: 1288, 1287, 1287, 1287, 1288, 1287, ... 
## Resampling results across tuning parameters:
## 
##   C      RMSE       Rsquared   MAE       
##    0.25  0.1292700  0.4519886  0.09627011
##    0.50  0.1262262  0.4746017  0.09335426
##    1.00  0.1237052  0.4933871  0.09105980
##    2.00  0.1215653  0.5095284  0.08960552
##    4.00  0.1208322  0.5159532  0.08930594
##    8.00  0.1218272  0.5114523  0.08998742
##   16.00  0.1247543  0.4960406  0.09217539
##   32.00  0.1304951  0.4666847  0.09638954
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02346075
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02346075 and C = 4.

model1 <- svmFit

rmse.svm.train = rmse(predict(svmFit, training_svm), y_train_svm)
r2.svm.train = R2(predict(svmFit, training_svm), y_train_svm)
rmse.svm.test = rmse(predict(svmFit, test_svm), y_test_svm)
r2.svm.test = R2(predict(svmFit, test_svm), y_test_svm)
kable(data.frame(Model=c("Support Vector Machine"), RMSE.train=c(rmse.svm.train), RSquared.train=c(r2.svm.train), RMSE.test=c(rmse.svm.test), RSquared.test=c(r2.svm.test)))

Model	RMSE.train	RSquared.train	RMSE.test	RSquared.test
Support Vector Machine	0.0810298	0.7855512	0.11937	0.5163765

5 Stochastic Gradient Boosting

Gradient Boosting Models (GBM) and Random Forests are both tree-based ensemble algorithms, the difference being in the way that ensembles are created. RFs create independent trees, and each reaches maximum depth and contributes equally to the final model. GBMs build dependent trees with minimum depth, and contributes differently to the final model. An advantage of GBMs is that they will have better performance than RFs if tuning is done properly, and they are good for detecting anomalies. On the other hand, GBMs are more compute-intensive than RFs, and more prone to over fitting.

y_train_gbm = subset(y_train.orig, sample == TRUE)
y_test_gbm = subset(y_train.orig, sample == FALSE)

training_gbm = subset(trainTransformed, sample == TRUE)
test_gbm = subset(trainTransformed, sample == FALSE)


gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9), 
                        n.trees = (1:30)*50, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)
                        
nrow(gbmGrid)

## [1] 90

set.seed(123)
gbmFit <- train(training_gbm, y_train_gbm,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE,
                 tuneGrid = gbmGrid,
                 ## Specify which metric to optimize
                 metric = "RMSE")
#summary(gbmFit)

rmse.gbm.train = rmse(y_train_gbm, predict(gbmFit, training_gbm))
r2.gbm.train = R2(y_train_gbm, predict(gbmFit, training_gbm))
rmse.gbm.test = rmse(y_test_gbm, predict(gbmFit, test_gbm))
r2.gbm.test = R2(y_test_gbm, predict(gbmFit, test_gbm))
kable(data.frame(Model=c("Stochastic Gradient Boosting"), RMSE.train=c(rmse.gbm.train), RSquared.train=c(r2.gbm.train), RMSE.test=c(rmse.gbm.test), RSquared.test=c(r2.gbm.test)))

Model	RMSE.train	RSquared.train	RMSE.test	RSquared.test
Stochastic Gradient Boosting	0.0756313	0.8198824	0.1067627	0.6126796

6 Random Forest Model

Random forests are a modification of bagging that builds a large collection of de-correlated trees [1]. They are considered to belong in the category of non-parametric models since the number of parameters grows with the size of the training set. They are considered to be an improvement to the use of CART (Classification and Regression Tree) models because they do not suffer from some of the problems associated with CART models, such as the fact that CART models are unstable: small changes to the structure of the input data can have large effects on the CART model [2]. Random forests are designed to be low-variance estimators.

Random forests are based on the basic idea of aggregating uncorrelated sets of predictors, since one way to reduce the variance of an estimate is to average several estimates together [2]. A random forest trains a randomly chosen set of input variables over a randomly chosen subset of the data, and aggregates together several such trees to produce an overall estimator. Random forests have proven to be quite successful in a variety of real-world applications and often are seen to generalize very well to unseen real-world data.

# Split the training data into a portion that is withheld from the model and used to evaluate
# the model.

set.seed(123)
sample = sample.split(training$PH, SplitRatio = .75)
training_forest = subset(training, sample == TRUE)
test_forest = subset(training, sample == FALSE)

set.seed(123)
rfFit <- randomForest::randomForest(PH ~ ., data = training_forest, importance = TRUE,
                                    ntree = 100, keep.forest = TRUE)
model2 <- rfFit

varImpPlot(rfFit, n.var=20,
           main="Important Variables in Random Forest Model (top 20 shown)")

We now compute RMSE values for the Random Forest model on both the training and the test (withheld) portion of the data set.

training_forest2 = dplyr::select(training_forest, -PH)
rfPred.train = predict(rfFit, training_forest2)
rmse.rf.train = rmse(training_forest$PH, rfPred.train)
r2.rf.train = R2(training_forest$PH, rfPred.train)
test_forest2 = dplyr::select(test_forest, -PH)
rfPred.test = predict(rfFit, test_forest2)
rmse.rf.test = rmse(test_forest$PH, rfPred.test)
r2.rf.test = R2(test_forest$PH, rfPred.test)
kable(data.frame(Model=c("Random Forest"), RMSE.train=c(rmse.rf.train), RSquared.train=c(r2.rf.train), RMSE.test=c(rmse.rf.test), RSquared.test=c(r2.rf.test)))

Model	RMSE.train	RSquared.train	RMSE.test	RSquared.test
Random Forest	0.0410204	0.9613318	0.0988593	0.6840023

7 Neural Network Model

Neural Networks are a powerful nonlinear technique inspired by theories about how the human brain works [5]. Neural Networks can be classifiers (when the output variable is categorical) or regression (when the output variable is numeric). In this problem we use a regression artificial neural network (ANN) using the nnet package in R. Below we build and evaluate a Neural Network model of the regression problem.

set.seed(123)
training_nn = training_forest2
training_nn_ph = training_forest$PH
test_nn = test_forest2
test_nn_ph = test_forest$PH
  
nnetFit <- nnet(training_nn, training_nn_ph,
                size = 4,
                decay = 0.01,
                linout = TRUE,
                trace = FALSE,
                maxit = 500, # Iterations
                ## Number of parameters used by the model
                MaxNWts= 4 * (ncol(training_nn) + 1) + 5 + 1)
nnetFit

## a 36-4-1 network with 153 weights
## options were - linear output units  decay=0.01

model3 <- nnetFit
rmse.nnet.train = rmse(training_nn_ph, predict(nnetFit, training_nn))
r2.nnet.train = R2(training_nn_ph, predict(nnetFit, training_nn))
rmse.nnet.test = rmse(test_nn_ph, predict(nnetFit, test_nn))
r2.nnet.test = R2(test_nn_ph, predict(nnetFit, test_nn))
kable(data.frame(Model=c("ANN"), RMSE.train=c(rmse.nnet.train), RSquared.train=c(r2.nnet.train), RMSE.test=c(rmse.nnet.test), RSquared.test=c(r2.nnet.test)))

Model	RMSE.train	RSquared.train	RMSE.test	RSquared.test
ANN	0.1152558	0.5558012	0.1253701	0.4689255

8 Generalized Linear Model

Because our output value is continuous and our data is numeric, we can use a generalized linear model to compute the pH. These models are generic and assume linearity in response. We will use the “Gaussian” type which assumes normally distributed variables

set.seed(123)
training_glm = subset(training, sample == TRUE)
test_glm = subset(training, sample == FALSE)

glm_train_label <- training_glm$PH
glm_test_label <- test_glm$PH
glm_train <- select(training_glm, -PH)
glm_test <- select(test_glm, -PH)

model4 <- glm(glm_train_label ~., glm_train, family = "gaussian")
model5 <- step(model4, direction = "backward", trace = FALSE)

sum4 <- summary(model4)
sum5 <- summary(model5)

plot(model4)

plot(model5)

rmse.glm.train <- rmse(predict(model5, glm_train), glm_train_label)
r2.glm.train <- R2(predict(model5, glm_train), glm_train_label)
rmse.glm.test <- rmse(predict(model5, glm_test), glm_test_label)
r2.glm.test <- R2(predict(model5, glm_test), glm_test_label)
kable(data.frame(Model=c("GLM"), RMSE.train=c(rmse.glm.train), RSquared.train=c(r2.glm.train), RMSE.test=c(rmse.glm.test), RSquared.test=c(r2.glm.test)))

Model	RMSE.train	RSquared.train	RMSE.test	RSquared.test
GLM	0.1308481	0.4274836	0.1319767	0.405455

As we can see from the summary plots, both generalized linear models have randomly distributed residuals and fall fairly close to the actual values. It therefore fulfills the assumptions of this model. However, the first quartile is consistently over-predicted, making it in appropriate for our purposes.

9 Comparison of Model Performance

We now compare and rank the RMSE errors produced by the various models on the portion of the training data that was withheld from the training of the models. Using the model with the smallest RMSE on the withheld training data, we also make a final prediction on the test data provided. The predictions are written to an Excel file.

Since we see that the Random Forest model produced the lowest RMSE on the withheld training data, we select it as the best model to predict PH in the manufacturing process data set. A final set of predictions is made using this model.

modelperf = data.frame(matrix(ncol=4, nrow=10))
colnames(modelperf) = c("Dataset", "Model", "RMSE", "RSquared")


modelperf[1,] = list("Train", "Random Forest", rmse.rf.train, r2.rf.train)
modelperf[2,] = list("Test", "Random Forest", rmse.rf.test, r2.rf.test)
modelperf[3,] = list("Train", "ANN", rmse.nnet.train, r2.nnet.train)
modelperf[4,] = list("Test", "ANN", rmse.nnet.test, r2.nnet.test)
modelperf[5,] = list("Train", "SVM", rmse.svm.train, r2.svm.train)
modelperf[6,] = list("Test", "SVM", rmse.svm.test, r2.svm.test)
modelperf[7,] = list("Train", "GBM", rmse.gbm.train, r2.gbm.train)
modelperf[8,] = list("Test", "GBM", rmse.gbm.test, r2.gbm.test)
modelperf[9,] = list("Train", "GLM", rmse.glm.train, r2.glm.train)
modelperf[10,] = list("Test", "GLM", rmse.glm.test, r2.glm.test)

ggplot(data=modelperf, aes(x=reorder(Model, RMSE), y=RMSE, fill=Dataset)) +
    geom_bar(stat="identity", position=position_dodge()) +
    ggtitle("Model RMSE for PH Prediction")

ggplot(data=modelperf, aes(x=reorder(Model, -RSquared), y=round(RSquared,3), fill=Dataset)) +
    geom_bar(stat="identity", position=position_dodge()) +
    ggtitle("Model RSquared for PH Prediction")

bestFit = model5
PH.pred = predict(bestFit, test)
write.xlsx(PH.pred, file="DATA624_Proj2.xlsx", sheetName="PH", append=F)

10 Conclusions

All of our models performed fairly well, with a measured RMSE of less than one-tenth of a pH point. Random Forecast outperformed other models with the smallest RMSE and largest RSquared values in training (0.0410 and 0.96) and test (0.099 and 0.684), and is selected as our best model. Random Forecast is a tree-based model, it is hard to get coefficients for each variables, but it does have a way to show the importance of the variables in affecting the target variables, as showed in the important variable lists above. The top 6 variables are

- Mnf.FLow,
- Band.Code_C (via one-hot-encoding of Brand.Code),
- Pressure.Vacuum,
- Oxygen.Filler, and
- Temperature

By changing these variables, we can maximize the control over pH of the product while minimizing the things we have to monitor and control. Further cost analysis on these processes would help determine our route forward.

11 Next Step

There are many ways we can continue improving the model performance, one method could be running more times of cross validation on more folds than 3 times 3-fold we have now for SVM and GBM models. It would take a longer time to compute, but the results would likely be better. Finally, more data would help us build a better model, in particular because the gap between the test and train sets tends to be relatively large across all of the models.

12 References

Random Forests. https://uc-r.github.io/random_forests
Kevin Murphy (2012). Machine Learning a Probabilistic Perspective.
Support Vector Machine. https://uc-r.github.io/svm
Support Vector Machines. http://web.mit.edu/6.034/wwwbob/svm.pdf
Kuhn et al (2013). Applied Predictive Modeling

Data 624 Project 2