knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
In this report, we generate a machine learning model that is able to predict the PH level of a drink based on many predictors. The independent and dependent variables that are used in order to generate this model use data collected from 2,571 different drink samples. The analysis detailed in this report shows the testing of several models. There were 3 different model categories, and from each different category, several different models that were tested out:
From these models, a best model was selected based on model
performance and various metrics. All of the models were evaluated based
on a test set generated from the StudentData.xlsx file
given to the team. Ultimately, it was decided that Random Forest was the
best and was selected based on its performance on the test set.
beverage_train <- readxl::read_excel(
"StudentData.xlsx"
)
beverage_test <- readxl::read_excel(
"StudentEvaluation.xlsx"
)
summary(beverage_train)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
The summary above shows us that the Brand Code is a
categorical variable. Therefore, we converted it into a factor so then R
recognizes it as a categorical variable.
beverage_train <- beverage_train %>%
mutate(`Brand Code` = as.factor(`Brand Code`))
beverage_test <- beverage_test %>%
mutate(`Brand Code` = as.factor(`Brand Code`))
The following code chunk produces a histogram for all of the
continuous variables in the beverage_train data frame.
The histograms above reveal that the following variables have a normal/skewed distribution. Some of these variables that are listed below display skewness but that can be easily corrected with a transformation:
Carb PressureCarb TempFill OuncesPC VolumePressure VacuumPHPSCPSC CO2PSC FillTemperatureMFROxygen FillerThe other variables that are not listed above have a bimodal or greater than bimodal distribution:
BallingBalling LvlCarb FlowCarb Pressure1Carb RelCarb VolumeDensityFill PressureHyd Pressure4Usage contPressure SetpointBoxplots for each of the continuous variables are shown below.
The output above shows us that the following variables have a significant number of outliers:
Filler SpeedTemperatureMFROxygen FillerAir PressurerThe correlation plot for the continuous variables is shown below.
corrplot::corrplot(cor(dplyr::select_if(beverage_train, is.numeric), use = "na.or.complete"),
method = 'number',
type = 'lower',
diag = FALSE,
number.cex = 0.75)
The correlation plot above reveals that some of the predictors have a very high correlation. On page 163 of the Applied Predictive Modeling textbook, it is recommended that the maximum absolute pairwise correlation between the predictors be less than 0.75. Therefore, the following predictor pairs have a correlation that is greater than 0.75:
Carb Pressure and Carb TempMnf Flow and Hyd Pressure3Hyd Pressure2 and Hyd Pressure3Density and Carb VolumeMFR and Filler SpeedBalling and Carb VolumeDensity and BallingFiller Level and Bowl SetpointPressure Setpoint and Fill PressureAlch Rel and Carb VolumeDensity and Alch RelBalling and Alch RelCarb Volume and Carb RelDensity and Carb RelBalling and Carb RelAlch Rel and Carb RelCarb Volume and Balling LvlDensity and Balling LvlBalling and Balling LvlAlch Rel and Balling LvlCarb Rel and Balling LvlAs can be seen in the figure below, some of the columns have missing values. These missing values were imputed using the MICE algorithm. The methodology that was used is explained in the “Dealing with Missing Values” section.
In general, imputations by the means/medians is acceptable if the missing values only account for 5% of the sample. Peng et al.(2006) However, should the degree of missing values exceed 20% then using these simple imputation approaches will result in an artificial reduction in variability due to the fact that values are being imputed at the center of the variable’s distribution.
Our team decided to employ another technique to handle the missing values: Multiple Regression Imputation using the MICE package.
The MICE package in R implements a methodology where each incomplete variable is imputed by a separate model. Alice points out that plausible values are drawn from a distribution specifically designed for each missing datapoint. Many imputation methods can be used within the package. The one that was selected for the data being analyzed in this report is PMM (Predictive Mean Matching), which is used for quantitative data.
Van Buuren explains that PMM works by selecting values from the observed/already existing data that would most likely belong to the variable in the observation with the missing value. The advantage of this is that it selects values that must exist from the observed data, so no negative values will be used to impute missing data.Not only that, it circumvents the shrinking of errors by using multiple regression models. The variability between the different imputed values gives a wider, but more correct standard error. Uncertainty is inherent in imputation which is why having multiple imputed values is important. Not only that. Marshall et al. 2010 points out that:
“Another simulation study that addressed skewed data concluded that predictive mean matching ‘may be the preferred approach provided that less than 50% of the cases have missing data…’
In order to get the mice algorithm to work, there should
be no spaces in the names. The following code chunk replaces all of the
spaces in the column names for the training and testing data with
underscores.
names(beverage_train) <- gsub(" ", "_", names(beverage_train))
names(beverage_test) <- gsub(" ", "_", names(beverage_test))
The following code chunks use predictive mean matching to impute the missing values.
The blue lines for each of the graphs in the figure above represent the distributions the non-missing data for each of the variables while the red lines represent the distributions for the imputed data. Note that the distributions for the imputed data for each of the iterations closely matches the distributions for the non-missing data, which is ideal. If the distributions did not match so well, than another imputing method would have had to have been used.
## Remove Predictors With Low Frequencies
Occasionally, datasets will contain predictors where there are very
few unique values relative to the total number of observations. Imagine
for example that for 1000 observations, 999 of them are a single value.
Such a variable should be omitted. The nearZeroVar function
in the caret package can be used in order to filter out
such predictors.
beverage_train <- beverage_train[, -nearZeroVar(beverage_train)]
beverage_test <- beverage_test[, names(beverage_train)]
In the code chunk above, nearZeroVar(beverage_train)
identified the 13th predictor as one with near zero variance, which
corresponds to Hyd Pressure1. Therefore, this variable was
omitted from the beverage_train and
beverage_test dataframes.
Page 54 in the Applied Predictive Modeling textbook explains that:
“To administer a series of transformations to multiple data sets, the
caret class preProcess has the ability to transform,
center, scale, or impute values, as well as apply the spatial sign
transformation and feature extraction. The function calculates the
required quantities for the transformation. After calling the
preProcess function, the predict method
applies the results to a set of data.”
In the train function in the caret package,
there is a preProcess parameter. The train
function is what will be used to create the models in the following
sections. This preProcess parameter will be set to a vector
containing any combination of these 3 strings: YeoJohnson,
center, scale. These three strings perform the
appropriate YeoJohnson transformations in addition to centering and
scaling the predictors. We originally were going to perform Box-Cox
transformations, but the summary of the dataset revealed that some of
the predictors contained negative values (Mnf Flow,
Hyd Pressure3, etc.). Centering and scaling is generally
performed to improve the numerical stability of some calculations. As
explained on page 31 of the Applied Predictive Modeling textbook, PLS
models benefit from predictors being in a common scale.
We’re going to use the Pareto Principle to split the data. Wikipedia explains that:
“The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of causes (the”vital few”).”
Therefore, we will be employing an 80/20 split to the
beverage_train dataset.
set.seed(1)
sample <- sample.split(beverage_train$PH, SplitRatio = 0.8)
train_X <- subset(beverage_train %>% select(-PH), sample == TRUE)
test_X <- subset(beverage_train %>% select(-PH), sample == FALSE)
train_y <- subset(beverage_train$PH, sample == TRUE)
test_y <- subset(beverage_train$PH, sample == FALSE)
sample_high_corr <- sample.split(beverage_train$PH, SplitRatio = 0.8)
train_X_high_corr <- subset(beverage_train_high_corr %>% select(-PH), sample == TRUE)
test_X_high_corr <- subset(beverage_train_high_corr %>% select(-PH), sample == FALSE)
train_y_high_corr <- subset(beverage_train_high_corr$PH, sample == TRUE)
test_y_high_corr <- subset(beverage_train_high_corr$PH, sample == FALSE)
For some of the models, we will be using the trControl
parameter in the caret::train function. Reason why is
explained in page 130 of the Applied Predictive Modeling textbook:
“The train function generates a resampling estimate of
performance. Because the training set size is not small, 10-fold
cross-validation should produce reasonable estimates of model
performance. The function trainControl specifies the type of
resampling.”
We will be using the testing set to compute the testing set RMSE, Rsquared, and MAE. We will use these metrics to select the best model.
Some of the models will not allow categorical variables to be used,
which means the Brand Code variable cannot be used in these
model types. Converting a categorical variable to a numerical one can be
done, but since the Brand Code variable takes on more than
two different categories, the conversion would not be advantageous. As
explained by Ashkon Farhangi:
“In the vast majority of cases using dummy variables is more statistically sound than using a single numerical variable. A single numerical variable does not accurately encode the information represented by a categorical variable, because of the relationships between numerical values it implicitly implies. On the number line, 1 is closer to 2 than it is to 3. However, by definition no pair of values assumed by categorical variables is more similar than any other. Encoding three such values as 1, 2 and 3 will result in the model inferring incorrect relationships between those numbers. The fact that real valued numbers fall on a number line that implies a greater degree of similarity between values that fall closer to one another along it; this notion is incompatible with categorical variables. Using dummy variables avoids this pitfall.”
Models where the Brand Code predictor was not used
include:
ctrl <- trainControl(method = "cv", number = 10)
We set the method to lm in the
train function in the caret package to let
train know that the data must be fit to a linear regression
model.
olr_model <- caret::train(train_X, train_y,
method = "lm",
trControl = ctrl,
preProcess = c("YeoJohnson", "center", "scale"))
olr_pred <- predict(olr_model, newdata = data.frame(test_X))
For this model, the maximum number of components was calculated.
set.seed(100)
pls_model <- train(train_X_high_corr,
train_y_high_corr,
method = "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("YeoJohnson", "center", "scale"))
pls_pred <- predict(pls_model, newdata = test_X_high_corr)
plot(pls_model)
The plot above shows us that a minimum RMSE was found at around 0.1347. This is approximately at 10 components.
In order to tune over the penalty, a dataframe of different lambdas
from 0 to 0.1 was generated. This dataframe was then used in the
tuneGrid parameter in the train function in
the caret package in order to find the lambda, and in
effect, the model, resulting in the lowest error.
## Define the candidate set of values
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(100)
ridgeReg_model <- train(train_X[,-1] %>% as.matrix(),
train_y %>% as.vector(),
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("YeoJohnson", "center", "scale"))
ridgeReg_pred <- predict(ridgeReg_model, newdata = test_X[,-1] %>% as.matrix())
plot(ridgeReg_model)
The plot above shows us that the lowest RMSE happens when the lambda/weight decay is at 0.007142857.
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
.bag = FALSE)
set.seed(100)
nnet_model <- train(train_X, train_y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("YeoJohnson", "center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 84851,
maxit = 500)
nnet_pred <- predict(nnet_model, newdata = data.frame(test_X))
We can display the variable importance for this model using the
varImp function.
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
mars_model <- train(train_X,
train_y,
method = "earth",
tuneGrid = marsGrid,
preProcess = c("YeoJohnson"),
trControl = trainControl(method = "cv"))
mars_pred <- predict(mars_model, newdata = test_X)
plot(varImp(mars_model))
According
to the caret documentation, for Multivariate Adaptive
Regression Splines:
“The varImp function tracks the changes in model
statistics, such as the GCV, for each predictor and accumulates the
reduction in the statistic when each predictor’s feature is added to the
model.”
The plot above us shows us that Mnf_Flow has a
significant amount of importance for this particular model. The
importance profile isn’t as steep as Figure 8.21 in the Applied
Predictive Modeling textbook (Compare Figure 8.6 to Figure 8.21 in the
Applied Predictive Modeling textbook). In other words, differences in
importance between predictors are subtle and not drastic for this
particular model.
set.seed(100)
svmR_model <- train(train_X[,-1] %>% as.matrix(),
train_y %>% as_vector(),
method = "svmRadial",
preProc = c("YeoJohnson", "center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmR_pred <- predict(svmR_model, newdata = test_X[,-1] %>% as.matrix())
svmR_model$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 2
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0386575195477038
##
## Number of Support Vectors : 1753
##
## Objective Function Value : -1516.099
## Training error : 0.382666
The subobject named finalModel contains the model as
shown in the above code chunk. The output above shows us that the model
used 1753 training set data points as support vectors (85.1% of the
training set).
The knnreg function in the caret package
fits the KNN regression model. train tunes the model over
k through the tuneGrid parameter, which
creates a grid of values for k from 1 to 20.
set.seed(100)
knn_model <- train(train_X[,-1] %>% as.matrix(),
train_y %>% as_vector(),
method = "knn",
preProc = c("YeoJohnson", "center", "scale"),
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = "cv"))
knn_pred <- predict(knn_model, newdata = test_X[,-1] %>% as.matrix())
According to this Medium article:
“Decision trees and ensemble methods do not require feature scaling to be performed as they are not sensitive to the the variance in the data.”
Therefore, the only preprocessing that will be done is performing a
Yeo-Johnson transformation using the preProcess
parameter.
set.seed(100)
single_trees_model <- train(train_X,
train_y,
method = "rpart2",
tuneLength = 20,
preProcess = c("YeoJohnson"),
trControl = trainControl(method = "cv"))
single_trees_pred <- predict(single_trees_model, newdata = test_X)
plot(single_trees_model)
The plot above shows us that the optimal Max Tree Depth
when creating the single tree model for the beverage dataset is around
15.
set.seed(100)
tree_model <- train(
train_X,
train_y,
method = "M5",
trControl = trainControl(method = "cv"),
preProcess = c("YeoJohnson"),
control = Weka_control(M = 10))
tree_pred <- predict(tree_model, newdata = test_X)
plot(tree_model)
The plot above shows the cross-validation profiles for the beverage data. Unpruned trees usually over-fit the training data, which is reflected in the plot above. With smoothing, the error rate is significantly improved. The figure above indicates that the optimal model used pruning and smoothing.
In this model, the ipredbagg function was used to create
the bagged tree model.
set.seed(100)
bagged_tree_model <- ipredbagg(train_y,
train_X,
preProcess = c("YeoJohnson"))
bagged_tree_pred <- predict(bagged_tree_model, newdata = test_X)
rf_model <- randomForest(train_X, train_y,
preProcess = c("YeoJohnson"),
importance = TRUE,
ntrees = 1000)
rf_pred <- predict(rf_model, newdata = test_X)
plot(rf_model)
The plot above shows that the error decreases sharply then reaches some asymptotic value before the number of trees reaches 50.
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(train_X,
train_y,
method = "gbm",
preProcess = c("YeoJohnson"),
tuneGrid = gbmGrid,
verbose = FALSE)
gbm_pred <- predict(gbmTune, newdata = test_X)
#gbmTune
plot(gbmTune)
The figure above shows us that the larger value of shrinkage has an impact on reducing the RMSE for all choices of tree depth and number of trees. Also RMSE decreases as tree depth increases. The output above also tells us that the optimal boosted tree has depth 7 with 750 trees and a shrinkage of 0.1.
cubistTuned <- train(
train_X,
train_y,
preProcess = c("YeoJohnson"),
trControl = trainControl(method = "cv"),
method = "cubist")
cubist_pred <- predict(cubistTuned, newdata = test_X)
plot(cubistTuned)
In this section, the model that will be selected to generate the
final test predictions will be selected. The selection criteria comes
down to three metrics. These three metrics were generated using the
postResample function:
All of these metrics will be generated using the test set that was generated when the 80/20 split was performed. Each of these metrics are displayed in the table below.
tracker %>% arrange(RMSE)
## Model RMSE Rsquared MAE
## 1 Random Forest 0.097 0.686 0.071
## 2 Cubist 0.101 0.649 0.073
## 3 Boosted Tree 0.104 0.628 0.077
## 4 Neural Network Model 0.116 0.538 0.086
## 5 Bagged Tree 0.118 0.52 0.092
## 6 Multivariate Adaptive Regression Splines 0.123 0.477 0.093
## 7 Single Tree 0.127 0.442 0.101
## 8 Model Tree 0.129 0.428 0.098
## 9 Support Vector Machines 0.13 0.423 0.094
## 10 K-Nearest Neighbors 0.132 0.394 0.096
## 11 Partial Least Squares 0.134 0.374 0.104
## 12 Ordinary Linear Regression 0.135 0.369 0.105
## 13 Penalized Regression Model 0.143 0.295 0.111
The table above shows that the Random Forest model has the lowest RMSE and the highest Rsquared. This model is therefore the best performing model for this beverage data based on the test set that was generated. Note that none of the Linear and Non-Linear regression models, except for the neural network model, were able to achieve R-squares above 0.5. It was not until the tree based models were used where the Rsquareds were able to rise above 0.5. However, the neural network model took upwards of 40 minutes to be generated as opposed to less than 5 minutes for many of the tree based models. Not only that, it is not even the most accurate model shown in the table above, and these factors illustrate the strength that tree based models have on this particular dataset.
randomForest::varImpPlot(rf_model,
main = "Variable Importance for Random Forest Model")
According to DataCamp:
“Mean Decrease Accuracy (%IncMSE) - This shows how much
our model accuracy decreases if we leave out that variable.
Mean Decrease Gini (IncNodePurity) - This is a measure
of variable importance based on the Gini impurity index used for the
calculating the splits in trees.
The higher the value of mean decrease accuracy or mean decrease gini score, the higher the importance of the variable to our model.”
The importance plot on the left shows that all of the variables above
Fill_Ounces would decrease the model accuracy by more than
10% if the variable is left out. It is shown that the
Mnf_Flow predictor plays a very important role in the
accuracy of the model, followed by Brand_Code.
In this section, predictions from the selected Random Forest model
will be generated for the beverage_test dataset and saved
to a .csv.
rf_test_pred_df <- data.frame(predict(rf_model, newdata = beverage_test %>% select(-PH)))
colnames(rf_test_pred_df) <- c("Predictions")
write.csv(rf_test_pred_df, file = "predictions.csv", row.names = FALSE)
In order to find the best model for this particular dataset, many models had to be generated. Most of these models would not have done any better than random guessing because their Rsquare values were below 0.5. The tree based models worked well with this dataset. The Random Forest model in particular not only had the highest Rsquared but also a relatively low runtime due to the fact that tree building can be done through parallel processing. One suggestion that the team has for this project would be to have detailed descriptors for each of the variables. This would allow the team to better interpret the importance that each predictor in the random forest model has in terms of the PH level of a drink.