knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Problem Statement

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Problem Statement and Goals

In this report, we generate a machine learning model that is able to predict the PH level of a drink based on many predictors. The independent and dependent variables that are used in order to generate this model use data collected from 2,571 different drink samples. The analysis detailed in this report shows the testing of several models. There were 3 different model categories, and from each different category, several different models that were tested out:

Linear Regression (Chapter 6 of Applied Predictive Modeling from Kuhn and Johnson)
- Ordinary Linear Regression
- Partial Least Squares
- Penalized Regression
Nonlinear Regression (Chapter 7 of Applied Predictive Modeling from Kuhn and Johnson)
- Neural Networks
- Multivariate Adaptive Regression Splines
- Support Vector Machines
- K-Nearest Neighbors
Regression Trees and Rule-Based Models (Chapter 8 of Applied Predictive Modeling from Kuhn and Johnson)
- Single Trees
- Model Trees
- Bagged Trees
- Random Forest
- Boosted Trees
- Cubist

From these models, a best model was selected based on model performance and various metrics. All of the models were evaluated based on a test set generated from the StudentData.xlsx file given to the team. Ultimately, it was decided that Random Forest was the best and was selected based on its performance on the test set.

Data Exploration

beverage_train <- readxl::read_excel(
  "StudentData.xlsx"
  )
beverage_test <- readxl::read_excel(
  "StudentEvaluation.xlsx"
  )

summary(beverage_train)

##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

The summary above shows us that the Brand Code is a categorical variable. Therefore, we converted it into a factor so then R recognizes it as a categorical variable.

beverage_train <- beverage_train %>%
  mutate(`Brand Code` = as.factor(`Brand Code`))

beverage_test <- beverage_test %>%
  mutate(`Brand Code` = as.factor(`Brand Code`))

The following code chunk produces a histogram for all of the continuous variables in the beverage_train data frame.

The histograms above reveal that the following variables have a normal/skewed distribution. Some of these variables that are listed below display skewness but that can be easily corrected with a transformation:

Carb Pressure
Carb Temp
Fill Ounces
PC Volume
Pressure Vacuum
PH
PSC
PSC CO2
PSC Fill
Temperature
MFR
Oxygen Filler

The other variables that are not listed above have a bimodal or greater than bimodal distribution:

Balling
Balling Lvl
Carb Flow
Carb Pressure1
Carb Rel
Carb Volume
Density
Fill Pressure
Hyd Pressure4
Usage cont
Pressure Setpoint

Boxplots for each of the continuous variables are shown below.

The output above shows us that the following variables have a significant number of outliers:

Filler Speed
Temperature
MFR
Oxygen Filler
Air Pressurer

The correlation plot for the continuous variables is shown below.

corrplot::corrplot(cor(dplyr::select_if(beverage_train, is.numeric), use = "na.or.complete"),
         method = 'number',
         type = 'lower',
         diag = FALSE,
         number.cex = 0.75)

The correlation plot above reveals that some of the predictors have a very high correlation. On page 163 of the Applied Predictive Modeling textbook, it is recommended that the maximum absolute pairwise correlation between the predictors be less than 0.75. Therefore, the following predictor pairs have a correlation that is greater than 0.75:

Carb Pressure and Carb Temp
Mnf Flow and Hyd Pressure3
Hyd Pressure2 and Hyd Pressure3
Density and Carb Volume
MFR and Filler Speed
Balling and Carb Volume
Density and Balling
Filler Level and Bowl Setpoint
Pressure Setpoint and Fill Pressure
Alch Rel and Carb Volume
Density and Alch Rel
Balling and Alch Rel
Carb Volume and Carb Rel
Density and Carb Rel
Balling and Carb Rel
Alch Rel and Carb Rel
Carb Volume and Balling Lvl
Density and Balling Lvl
Balling and Balling Lvl
Alch Rel and Balling Lvl
Carb Rel and Balling Lvl

NA exploration

As can be seen in the figure below, some of the columns have missing values. These missing values were imputed using the MICE algorithm. The methodology that was used is explained in the “Dealing with Missing Values” section.

Data Preparation

Dealing with Missing Values

In general, imputations by the means/medians is acceptable if the missing values only account for 5% of the sample. Peng et al.(2006) However, should the degree of missing values exceed 20% then using these simple imputation approaches will result in an artificial reduction in variability due to the fact that values are being imputed at the center of the variable’s distribution.

Our team decided to employ another technique to handle the missing values: Multiple Regression Imputation using the MICE package.

The MICE package in R implements a methodology where each incomplete variable is imputed by a separate model. Alice points out that plausible values are drawn from a distribution specifically designed for each missing datapoint. Many imputation methods can be used within the package. The one that was selected for the data being analyzed in this report is PMM (Predictive Mean Matching), which is used for quantitative data.

Van Buuren explains that PMM works by selecting values from the observed/already existing data that would most likely belong to the variable in the observation with the missing value. The advantage of this is that it selects values that must exist from the observed data, so no negative values will be used to impute missing data.Not only that, it circumvents the shrinking of errors by using multiple regression models. The variability between the different imputed values gives a wider, but more correct standard error. Uncertainty is inherent in imputation which is why having multiple imputed values is important. Not only that. Marshall et al. 2010 points out that:

“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data…’

In order to get the mice algorithm to work, there should be no spaces in the names. The following code chunk replaces all of the spaces in the column names for the training and testing data with underscores.

names(beverage_train) <- gsub(" ", "_", names(beverage_train))
names(beverage_test) <- gsub(" ", "_", names(beverage_test))

The following code chunks use predictive mean matching to impute the missing values.

The blue lines for each of the graphs in the figure above represent the distributions the non-missing data for each of the variables while the red lines represent the distributions for the imputed data. Note that the distributions for the imputed data for each of the iterations closely matches the distributions for the non-missing data, which is ideal. If the distributions did not match so well, than another imputing method would have had to have been used.

## Remove Predictors With Low Frequencies

Occasionally, datasets will contain predictors where there are very few unique values relative to the total number of observations. Imagine for example that for 1000 observations, 999 of them are a single value. Such a variable should be omitted. The nearZeroVar function in the caret package can be used in order to filter out such predictors.

beverage_train <- beverage_train[, -nearZeroVar(beverage_train)]
beverage_test <- beverage_test[, names(beverage_train)]

In the code chunk above, nearZeroVar(beverage_train) identified the 13th predictor as one with near zero variance, which corresponds to Hyd Pressure1. Therefore, this variable was omitted from the beverage_train and beverage_test dataframes.

Removing Highly Correlated Predictors

We will use the methodology outlined in page 163 of the Applied Predictive Modeling textbook by Kuhn and Johnson in order to ensure that the absolute pairwise correlation between the predictors is less than 0.75. For the beverage_test set that will be used to generate the final predictions, only the predictors that were selected from the beverage_train dataset after filtering out the highly correlated predictors will be used.

Before the removal of the highly correlated predictors is performed. We must be cognizant of the fact that PLS models “allows you to reduce the dimensionality of correlated variables and model the underlying, shared, information of those variables (in both dependent and independent variables).” (source). Therefore, we will be using all of the remaining variables for the PLS model generated in the “Partial Least Squares” section.

tooHigh <- findCorrelation(cor(beverage_train %>%
  select(-c(Brand_Code, PH))), cutoff = 0.75)

beverage_train_high_corr <- beverage_train
beverage_test_high_corr <- beverage_test

beverage_train <- beverage_train[, -tooHigh]
beverage_test <- beverage_test[, -tooHigh]

Yeo-Johnson Transformations, Centering, Scaling

Page 54 in the Applied Predictive Modeling textbook explains that:

“To administer a series of transformations to multiple data sets, the caret class preProcess has the ability to transform, center, scale, or impute values, as well as apply the spatial sign transformation and feature extraction. The function calculates the required quantities for the transformation. After calling the preProcess function, the predict method applies the results to a set of data.”

In the train function in the caret package, there is a preProcess parameter. The train function is what will be used to create the models in the following sections. This preProcess parameter will be set to a vector containing any combination of these 3 strings: YeoJohnson, center, scale. These three strings perform the appropriate YeoJohnson transformations in addition to centering and scaling the predictors. We originally were going to perform Box-Cox transformations, but the summary of the dataset revealed that some of the predictors contained negative values (Mnf Flow, Hyd Pressure3, etc.). Centering and scaling is generally performed to improve the numerical stability of some calculations. As explained on page 31 of the Applied Predictive Modeling textbook, PLS models benefit from predictors being in a common scale.

Data Splitting

We’re going to use the Pareto Principle to split the data. Wikipedia explains that:

“The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of causes (the”vital few”).”

Therefore, we will be employing an 80/20 split to the beverage_train dataset.

set.seed(1)
sample <- sample.split(beverage_train$PH, SplitRatio = 0.8)

train_X  <- subset(beverage_train %>% select(-PH), sample == TRUE)
test_X   <- subset(beverage_train %>% select(-PH), sample == FALSE)
train_y <- subset(beverage_train$PH, sample == TRUE)
test_y <- subset(beverage_train$PH, sample == FALSE)

sample_high_corr <- sample.split(beverage_train$PH, SplitRatio = 0.8)

train_X_high_corr <- subset(beverage_train_high_corr %>% select(-PH), sample == TRUE)
test_X_high_corr <- subset(beverage_train_high_corr %>% select(-PH), sample == FALSE)
train_y_high_corr <- subset(beverage_train_high_corr$PH, sample == TRUE)
test_y_high_corr <- subset(beverage_train_high_corr$PH, sample == FALSE)

General Modeling

For some of the models, we will be using the trControl parameter in the caret::train function. Reason why is explained in page 130 of the Applied Predictive Modeling textbook:

“The train function generates a resampling estimate of performance. Because the training set size is not small, 10-fold cross-validation should produce reasonable estimates of model performance. The function trainControl speciﬁes the type of resampling.”

We will be using the testing set to compute the testing set RMSE, Rsquared, and MAE. We will use these metrics to select the best model.

Some of the models will not allow categorical variables to be used, which means the Brand Code variable cannot be used in these model types. Converting a categorical variable to a numerical one can be done, but since the Brand Code variable takes on more than two different categories, the conversion would not be advantageous. As explained by Ashkon Farhangi:

“In the vast majority of cases using dummy variables is more statistically sound than using a single numerical variable. A single numerical variable does not accurately encode the information represented by a categorical variable, because of the relationships between numerical values it implicitly implies. On the number line, 1 is closer to 2 than it is to 3. However, by definition no pair of values assumed by categorical variables is more similar than any other. Encoding three such values as 1, 2 and 3 will result in the model inferring incorrect relationships between those numbers. The fact that real valued numbers fall on a number line that implies a greater degree of similarity between values that fall closer to one another along it; this notion is incompatible with categorical variables. Using dummy variables avoids this pitfall.”

Models where the Brand Code predictor was not used include:

Penalized Regression
Support Vector Machines
K-Nearest Neighbors

ctrl <- trainControl(method = "cv", number = 10)

Linear Regression Modeling

Ordinary Linear Regression

We set the method to lm in the train function in the caret package to let train know that the data must be fit to a linear regression model.

olr_model <- caret::train(train_X, train_y, 
                          method = "lm", 
                          trControl = ctrl,
                          preProcess = c("YeoJohnson", "center", "scale"))

olr_pred <- predict(olr_model, newdata = data.frame(test_X))

Partial Least Squares

For this model, the maximum number of components was calculated.

set.seed(100)

pls_model <- train(train_X_high_corr, 
                   train_y_high_corr,
                   method = "pls",
                   tuneLength = 20,
                   trControl = ctrl,
                   preProc = c("YeoJohnson", "center", "scale"))

pls_pred <- predict(pls_model, newdata = test_X_high_corr)

plot(pls_model)

The plot above shows us that a minimum RMSE was found at around 0.1347. This is approximately at 10 components.

Penalized Regression Model

In order to tune over the penalty, a dataframe of different lambdas from 0 to 0.1 was generated. This dataframe was then used in the tuneGrid parameter in the train function in the caret package in order to find the lambda, and in effect, the model, resulting in the lowest error.

## Define the candidate set of values
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))

set.seed(100)

ridgeReg_model <- train(train_X[,-1] %>% as.matrix(),
                        train_y %>% as.vector(),
                        method = "ridge",
                        tuneGrid = ridgeGrid,
                        trControl = ctrl,
                        preProc = c("YeoJohnson", "center", "scale"))

ridgeReg_pred <- predict(ridgeReg_model, newdata = test_X[,-1] %>% as.matrix())

plot(ridgeReg_model)

The plot above shows us that the lowest RMSE happens when the lambda/weight decay is at 0.007142857.

Non-Linear Regression Modeling

Neural Network Model

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),
                        .bag = FALSE)

set.seed(100)

nnet_model <- train(train_X, train_y,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("YeoJohnson", "center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 84851,
                  maxit = 500)

nnet_pred <- predict(nnet_model, newdata = data.frame(test_X))

Multivariate Adaptive Regression Splines

We can display the variable importance for this model using the varImp function.

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)

mars_model <- train(train_X,
                    train_y,
                    method = "earth",
                    tuneGrid = marsGrid,
                    preProcess = c("YeoJohnson"),
                    trControl = trainControl(method = "cv"))

mars_pred <- predict(mars_model, newdata = test_X)

plot(varImp(mars_model))

According to the caret documentation, for Multivariate Adaptive Regression Splines:

“The varImp function tracks the changes in model statistics, such as the GCV, for each predictor and accumulates the reduction in the statistic when each predictor’s feature is added to the model.”

The plot above us shows us that Mnf_Flow has a significant amount of importance for this particular model. The importance profile isn’t as steep as Figure 8.21 in the Applied Predictive Modeling textbook (Compare Figure 8.6 to Figure 8.21 in the Applied Predictive Modeling textbook). In other words, differences in importance between predictors are subtle and not drastic for this particular model.

Support Vector Machines

set.seed(100)

svmR_model <- train(train_X[,-1] %>% as.matrix(),
                    train_y %>% as_vector(),
                    method = "svmRadial",
                    preProc = c("YeoJohnson", "center", "scale"),
                    tuneLength = 14,
                    trControl = trainControl(method = "cv"))

svmR_pred <- predict(svmR_model, newdata = test_X[,-1] %>% as.matrix())

svmR_model$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0386575195477038 
## 
## Number of Support Vectors : 1753 
## 
## Objective Function Value : -1516.099 
## Training error : 0.382666

The subobject named finalModel contains the model as shown in the above code chunk. The output above shows us that the model used 1753 training set data points as support vectors (85.1% of the training set).

K-Nearest Neighbors

The knnreg function in the caret package ﬁts the KNN regression model. train tunes the model over k through the tuneGrid parameter, which creates a grid of values for k from 1 to 20.

set.seed(100)

knn_model <- train(train_X[,-1] %>% as.matrix(),
                   train_y %>% as_vector(),
                   method = "knn",
                   preProc = c("YeoJohnson", "center", "scale"),
                   tuneGrid = data.frame(.k = 1:20),
                   trControl = trainControl(method = "cv"))

knn_pred <- predict(knn_model, newdata = test_X[,-1] %>% as.matrix())

Regression Trees and Rule-Based Models

According to this Medium article:

“Decision trees and ensemble methods do not require feature scaling to be performed as they are not sensitive to the the variance in the data.”

Therefore, the only preprocessing that will be done is performing a Yeo-Johnson transformation using the preProcess parameter.

Single Trees Model

set.seed(100)

single_trees_model <- train(train_X, 
                            train_y,
                            method = "rpart2",
                            tuneLength = 20,
                            preProcess = c("YeoJohnson"),
                            trControl = trainControl(method = "cv"))

single_trees_pred <- predict(single_trees_model, newdata = test_X)

plot(single_trees_model)

The plot above shows us that the optimal Max Tree Depth when creating the single tree model for the beverage dataset is around 15.

Model Tree

set.seed(100)

tree_model <- train(
                train_X, 
                train_y,
                method = "M5",
                trControl = trainControl(method = "cv"),
                preProcess = c("YeoJohnson"),
                control = Weka_control(M = 10))

tree_pred <- predict(tree_model, newdata = test_X)

plot(tree_model)

The plot above shows the cross-validation profiles for the beverage data. Unpruned trees usually over-fit the training data, which is reflected in the plot above. With smoothing, the error rate is significantly improved. The figure above indicates that the optimal model used pruning and smoothing.

Bagged Tree Model

In this model, the ipredbagg function was used to create the bagged tree model.

set.seed(100)

bagged_tree_model <- ipredbagg(train_y, 
                               train_X,
                               preProcess = c("YeoJohnson"))

bagged_tree_pred <- predict(bagged_tree_model, newdata = test_X)

Random Forest Model

rf_model <- randomForest(train_X, train_y,
                         preProcess = c("YeoJohnson"),
                        importance = TRUE,
                        ntrees = 1000)

rf_pred <- predict(rf_model, newdata = test_X)

plot(rf_model)

The plot above shows that the error decreases sharply then reaches some asymptotic value before the number of trees reaches 50.

Boosted Tree Model

gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
                       .n.trees = seq(100, 1000, by = 50),
                       .shrinkage = c(0.01, 0.1),
                       .n.minobsinnode = 10)

set.seed(100)

gbmTune <- train(train_X, 
                 train_y,
                 method = "gbm",
                 preProcess = c("YeoJohnson"),
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbm_pred <- predict(gbmTune, newdata = test_X)

#gbmTune
plot(gbmTune)

The figure above shows us that the larger value of shrinkage has an impact on reducing the RMSE for all choices of tree depth and number of trees. Also RMSE decreases as tree depth increases. The output above also tells us that the optimal boosted tree has depth 7 with 750 trees and a shrinkage of 0.1.

Cubist

cubistTuned <- train(
  train_X, 
  train_y,
  preProcess = c("YeoJohnson"),
  trControl = trainControl(method = "cv"),
  method = "cubist")

cubist_pred <- predict(cubistTuned, newdata = test_X)

plot(cubistTuned)

Modeling Selection

In this section, the model that will be selected to generate the final test predictions will be selected. The selection criteria comes down to three metrics. These three metrics were generated using the postResample function:

MAE
- Measures average distance of each measurement from the total average value. Small errors drown out rare, but large errors.
RMSE
- This is the MAE, but the error is squared, then the entire square root of the equation is taken. This is to reduce the effect of small errors will increasing the effect of large errors.
Rsquared
- Represents how much of the variation of a dependent variable is explained by an independent variable in a regression model.

All of these metrics will be generated using the test set that was generated when the 80/20 split was performed. Each of these metrics are displayed in the table below.

tracker %>% arrange(RMSE)

##                                       Model  RMSE Rsquared   MAE
## 1                             Random Forest 0.097    0.686 0.071
## 2                                    Cubist 0.101    0.649 0.073
## 3                              Boosted Tree 0.104    0.628 0.077
## 4                      Neural Network Model 0.116    0.538 0.086
## 5                               Bagged Tree 0.118     0.52 0.092
## 6  Multivariate Adaptive Regression Splines 0.123    0.477 0.093
## 7                               Single Tree 0.127    0.442 0.101
## 8                                Model Tree 0.129    0.428 0.098
## 9                   Support Vector Machines  0.13    0.423 0.094
## 10                      K-Nearest Neighbors 0.132    0.394 0.096
## 11                    Partial Least Squares 0.134    0.374 0.104
## 12               Ordinary Linear Regression 0.135    0.369 0.105
## 13               Penalized Regression Model 0.143    0.295 0.111

The table above shows that the Random Forest model has the lowest RMSE and the highest Rsquared. This model is therefore the best performing model for this beverage data based on the test set that was generated. Note that none of the Linear and Non-Linear regression models, except for the neural network model, were able to achieve R-squares above 0.5. It was not until the tree based models were used where the Rsquareds were able to rise above 0.5. However, the neural network model took upwards of 40 minutes to be generated as opposed to less than 5 minutes for many of the tree based models. Not only that, it is not even the most accurate model shown in the table above, and these factors illustrate the strength that tree based models have on this particular dataset.

randomForest::varImpPlot(rf_model,
                         main = "Variable Importance for Random Forest Model")

According to DataCamp:

“Mean Decrease Accuracy (%IncMSE) - This shows how much our model accuracy decreases if we leave out that variable.

Mean Decrease Gini (IncNodePurity) - This is a measure of variable importance based on the Gini impurity index used for the calculating the splits in trees.

The higher the value of mean decrease accuracy or mean decrease gini score, the higher the importance of the variable to our model.”

The importance plot on the left shows that all of the variables above Fill_Ounces would decrease the model accuracy by more than 10% if the variable is left out. It is shown that the Mnf_Flow predictor plays a very important role in the accuracy of the model, followed by Brand_Code.

Generating Predictions

In this section, predictions from the selected Random Forest model will be generated for the beverage_test dataset and saved to a .csv.

rf_test_pred_df <- data.frame(predict(rf_model, newdata = beverage_test %>% select(-PH)))
colnames(rf_test_pred_df) <- c("Predictions")

write.csv(rf_test_pred_df, file = "predictions.csv", row.names = FALSE)

Conclusion

In order to find the best model for this particular dataset, many models had to be generated. Most of these models would not have done any better than random guessing because their Rsquare values were below 0.5. The tree based models worked well with this dataset. The Random Forest model in particular not only had the highest Rsquared but also a relatively low runtime due to the fact that tree building can be done through parallel processing. One suggestion that the team has for this project would be to have detailed descriptors for each of the variables. This would allow the team to better interpret the importance that each predictor in the random forest model has in terms of the PH level of a drink.

DATA 624 - Project 2

Peter Phung, Krutika Patel, Alec McCabe

2023-05-10