Intoduction

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

library(tidyverse)
library(fpp3)
library(caret)
library(RANN)
library(psych)
library(DataExplorer)
library(randomForest)
library(Cubist)
library(rpart)
library(writexl)
library(openxlsx)
library(corrplot)
library(mice)
library(earth)

Data Exploration

Loading and Evaluating the Data

Historical data was provided in an Excel document. The data is read into a dataframe and evaluated.

#train.df = read.csv("StudentData.csv")
#test.df = read.csv("StudentEvaluation.csv")
train.df <- read.xlsx('StudentData.xlsx', sheet = 1)
eval.df <- read.xlsx('StudentEvaluation.xlsx', sheet = 1)

glimpse(train.df)

## Rows: 2,571
## Columns: 33
## $ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…

Categorical Data

Brand.Code is the only categorical predictor in the data set and needs to be converted to a factor.

#train.df$Brand.Code = as.factor(train.df$Brand.Code)
#test.df$Brand.Code = as.factor(test.df$Brand.Code)

train.df <- train.df |>
  mutate(Brand.Code = as.factor(Brand.Code))

eval.df <- eval.df |>
  mutate(Brand.Code = as.factor(Brand.Code))

train.df |>
  ggplot() + 
  geom_bar(aes(x = Brand.Code)) + 
  labs(title = 'Distribution of Brand.Code')

Numerical Distributions

PH is the response variable and the remaining are predcictors for that response.

train.df |>
  keep(is.numeric ) |>
  summary()

##   Carb.Volume     Fill.Ounces      PC.Volume       Carb.Pressure  
##  Min.   :5.040   Min.   :23.63   Min.   :0.07933   Min.   :57.00  
##  1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917   1st Qu.:65.60  
##  Median :5.347   Median :23.97   Median :0.27133   Median :68.20  
##  Mean   :5.370   Mean   :23.97   Mean   :0.27712   Mean   :68.19  
##  3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200   3rd Qu.:70.60  
##  Max.   :5.700   Max.   :24.32   Max.   :0.47800   Max.   :79.40  
##  NA's   :10      NA's   :38      NA's   :39        NA's   :27     
##    Carb.Temp          PSC             PSC.Fill         PSC.CO2       
##  Min.   :128.6   Min.   :0.00200   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000   1st Qu.:0.02000  
##  Median :140.8   Median :0.07600   Median :0.1800   Median :0.04000  
##  Mean   :141.1   Mean   :0.08457   Mean   :0.1954   Mean   :0.05641  
##  3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600   3rd Qu.:0.08000  
##  Max.   :154.0   Max.   :0.27000   Max.   :0.6200   Max.   :0.24000  
##  NA's   :26      NA's   :33        NA's   :23       NA's   :39       
##     Mnf.Flow       Carb.Pressure1  Fill.Pressure   Hyd.Pressure1  
##  Min.   :-100.20   Min.   :105.6   Min.   :34.60   Min.   :-0.80  
##  1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00   1st Qu.: 0.00  
##  Median :  65.20   Median :123.2   Median :46.40   Median :11.40  
##  Mean   :  24.57   Mean   :122.6   Mean   :47.92   Mean   :12.44  
##  3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00   3rd Qu.:20.20  
##  Max.   : 229.40   Max.   :140.2   Max.   :60.40   Max.   :58.00  
##  NA's   :2         NA's   :32      NA's   :22      NA's   :11     
##  Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4     Filler.Level  
##  Min.   : 0.00   Min.   :-1.20   Min.   : 52.00   Min.   : 55.8  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00   1st Qu.: 98.3  
##  Median :28.60   Median :27.60   Median : 96.00   Median :118.4  
##  Mean   :20.96   Mean   :20.46   Mean   : 96.29   Mean   :109.3  
##  3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00   3rd Qu.:120.0  
##  Max.   :59.40   Max.   :50.00   Max.   :142.00   Max.   :161.2  
##  NA's   :15      NA's   :15      NA's   :30       NA's   :20     
##   Filler.Speed   Temperature      Usage.cont      Carb.Flow       Density     
##  Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26   Min.   :0.240  
##  1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144   1st Qu.:0.900  
##  Median :3982   Median :65.60   Median :21.79   Median :3028   Median :0.980  
##  Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468   Mean   :1.174  
##  3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186   3rd Qu.:1.620  
##  Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104   Max.   :1.920  
##  NA's   :57     NA's   :14      NA's   :5       NA's   :2      NA's   :1      
##       MFR           Balling       Pressure.Vacuum        PH       
##  Min.   : 31.4   Min.   :-0.170   Min.   :-6.600   Min.   :7.880  
##  1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600   1st Qu.:8.440  
##  Median :724.0   Median : 1.648   Median :-5.400   Median :8.540  
##  Mean   :704.0   Mean   : 2.198   Mean   :-5.216   Mean   :8.546  
##  3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000   3rd Qu.:8.680  
##  Max.   :868.6   Max.   : 4.012   Max.   :-3.600   Max.   :9.360  
##  NA's   :212     NA's   :1                         NA's   :4      
##  Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint Air.Pressurer  
##  Min.   :0.00240   Min.   : 70.0   Min.   :44.00     Min.   :140.8  
##  1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2  
##  Median :0.03340   Median :120.0   Median :46.00     Median :142.6  
##  Mean   :0.04684   Mean   :109.3   Mean   :47.62     Mean   :142.8  
##  3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:143.0  
##  Max.   :0.40000   Max.   :140.0   Max.   :52.00     Max.   :148.2  
##  NA's   :12        NA's   :2       NA's   :12                       
##     Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :6.560   Median :5.400   Median :1.48  
##  Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :8.620   Max.   :6.060   Max.   :3.66  
##  NA's   :9       NA's   :10      NA's   :1

Predictor Correlations

We constructed a correlation plot to determine which predictors are related to PH, and also to determine if there are predictors that are highly related to other variables.

train.df |>
  keep(is.numeric) |>
  na.omit() |>
  cor() |>
  corrplot(tl.col = 'black', tl.cex = .6)

Look at the chart, there are several variables that have a very low correlation to PH, for example, Carb.Temp, PSC, PSC.Fill, and PSC.CO2, seem to have little relationship to PH.

Near-Zero Variance

We want to look for variables that have near-zero variance which indicate that they play little to no role in determining the PH of the beverage.

train.df |>
  nearZeroVar(saveMetrics = TRUE) |>
  filter(nzv == TRUE)

##               freqRatio percentUnique zeroVar  nzv
## Hyd.Pressure1  31.11111      9.529366   FALSE TRUE

The predictor Hyd.Pressure1 has near-zero variance and will be removed from the model.

Missing Values

Many of the predictors are missing values that will need to be evaluated and imputed to develop a consistent model.

plot_missing(train.df)

Using the mice package with Predictive Mean Matching to impute the missing values. Then removing Hyd.Pressure1 which has near-zero variance.

#train.impute.df = predict(preProcess(train.df, method="bagImpute"), train.df)

impute.df <- train.df |>
  mice(m = 1, method = 'pmm', print = FALSE) |>
  complete()

impute.df <- impute.df[,-nearZeroVar(impute.df)]

summary(impute.df)

##  Brand.Code  Carb.Volume     Fill.Ounces      PC.Volume       Carb.Pressure  
##  A: 313     Min.   :5.040   Min.   :23.63   Min.   :0.07933   Min.   :57.00  
##  B:1315     1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23933   1st Qu.:65.60  
##  C: 324     Median :5.347   Median :23.97   Median :0.27133   Median :68.20  
##  D: 619     Mean   :5.370   Mean   :23.97   Mean   :0.27789   Mean   :68.21  
##             3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31267   3rd Qu.:70.60  
##             Max.   :5.700   Max.   :24.32   Max.   :0.47800   Max.   :79.40  
##    Carb.Temp          PSC             PSC.Fill         PSC.CO2       
##  Min.   :128.6   Min.   :0.00200   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000   1st Qu.:0.02000  
##  Median :140.8   Median :0.07600   Median :0.1800   Median :0.04000  
##  Mean   :141.1   Mean   :0.08461   Mean   :0.1962   Mean   :0.05662  
##  3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600   3rd Qu.:0.08000  
##  Max.   :154.0   Max.   :0.27000   Max.   :0.6200   Max.   :0.24000  
##     Mnf.Flow       Carb.Pressure1  Fill.Pressure   Hyd.Pressure2  
##  Min.   :-100.20   Min.   :105.6   Min.   :34.60   Min.   : 0.00  
##  1st Qu.:-100.00   1st Qu.:118.8   1st Qu.:46.00   1st Qu.: 0.00  
##  Median :  64.80   Median :123.2   Median :46.40   Median :28.60  
##  Mean   :  24.47   Mean   :122.6   Mean   :47.92   Mean   :20.97  
##  3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00   3rd Qu.:34.60  
##  Max.   : 229.40   Max.   :140.2   Max.   :60.40   Max.   :59.40  
##  Hyd.Pressure3   Hyd.Pressure4     Filler.Level    Filler.Speed 
##  Min.   :-1.20   Min.   : 52.00   Min.   : 55.8   Min.   : 998  
##  1st Qu.: 0.00   1st Qu.: 86.00   1st Qu.: 97.7   1st Qu.:3819  
##  Median :27.40   Median : 96.00   Median :118.4   Median :3980  
##  Mean   :20.44   Mean   : 96.55   Mean   :109.2   Mean   :3638  
##  3rd Qu.:33.30   3rd Qu.:102.00   3rd Qu.:120.0   3rd Qu.:3996  
##  Max.   :50.00   Max.   :142.00   Max.   :161.2   Max.   :4030  
##   Temperature      Usage.cont      Carb.Flow       Density           MFR       
##  Min.   :63.60   Min.   :12.08   Min.   :  26   Min.   :0.240   Min.   : 31.4  
##  1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1151   1st Qu.:0.900   1st Qu.:695.0  
##  Median :65.60   Median :21.80   Median :3028   Median :0.980   Median :721.4  
##  Mean   :65.98   Mean   :20.99   Mean   :2469   Mean   :1.174   Mean   :668.9  
##  3rd Qu.:66.40   3rd Qu.:23.76   3rd Qu.:3187   3rd Qu.:1.620   3rd Qu.:730.4  
##  Max.   :76.20   Max.   :25.90   Max.   :5104   Max.   :1.920   Max.   :868.6  
##     Balling       Pressure.Vacuum        PH        Oxygen.Filler    
##  Min.   :-0.170   Min.   :-6.600   Min.   :7.880   Min.   :0.00240  
##  1st Qu.: 1.496   1st Qu.:-5.600   1st Qu.:8.440   1st Qu.:0.02200  
##  Median : 1.648   Median :-5.400   Median :8.540   Median :0.03340  
##  Mean   : 2.197   Mean   :-5.216   Mean   :8.546   Mean   :0.04699  
##  3rd Qu.: 3.292   3rd Qu.:-5.000   3rd Qu.:8.680   3rd Qu.:0.06000  
##  Max.   : 4.012   Max.   :-3.600   Max.   :9.360   Max.   :0.40000  
##  Bowl.Setpoint   Pressure.Setpoint Air.Pressurer      Alch.Rel    
##  Min.   : 70.0   Min.   :44.00     Min.   :140.8   Min.   :5.280  
##  1st Qu.:100.0   1st Qu.:46.00     1st Qu.:142.2   1st Qu.:6.540  
##  Median :120.0   Median :46.00     Median :142.6   Median :6.560  
##  Mean   :109.3   Mean   :47.61     Mean   :142.8   Mean   :6.897  
##  3rd Qu.:120.0   3rd Qu.:50.00     3rd Qu.:143.0   3rd Qu.:7.230  
##  Max.   :140.0   Max.   :52.00     Max.   :148.2   Max.   :8.620  
##     Carb.Rel      Balling.Lvl  
##  Min.   :4.960   Min.   :0.00  
##  1st Qu.:5.340   1st Qu.:1.38  
##  Median :5.400   Median :1.48  
##  Mean   :5.436   Mean   :2.05  
##  3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :6.060   Max.   :3.66

Build Models

Split-Data

We will create an 80/20 split of the data so that we can evaluate the effectiveness of each model.

LM

Construct a Linear Regression Model with the training set and evaluate against the test set.

set.seed(200)

lm.model <- train(train.x, train.y,
                  method = "lm",
                  trControl = trainControl(method = "cv",number = 10))
lm.model

## Linear Regression 
## 
## 2058 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1330872  0.4011177  0.1040745
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

lm.pred <- predict(lm.model, newdata = test.x)
postResample(pred = lm.pred, obs = test.y)

##      RMSE  Rsquared       MAE 
## 0.1396969 0.3764328 0.1043370

PLS

Construct a Partial Least Squares Model with the training set and evaluate against the test set.

set.seed(200)

pls.model <- train(train.x, train.y,
                   method='pls',
                   tuneLength=20,
                   trControl=trainControl(method='cv', number=10),
                   preProc=c('center','scale'))

pls.model

## Partial Least Squares 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1525958  0.2112304  0.1218642
##    2     0.1450553  0.2864518  0.1140696
##    3     0.1414494  0.3225656  0.1113618
##    4     0.1394063  0.3421337  0.1097266
##    5     0.1367553  0.3668592  0.1071295
##    6     0.1351591  0.3816178  0.1065537
##    7     0.1344708  0.3882548  0.1056289
##    8     0.1341161  0.3914107  0.1054017
##    9     0.1338555  0.3938736  0.1048611
##   10     0.1336656  0.3956137  0.1046495
##   11     0.1335530  0.3964762  0.1046391
##   12     0.1334661  0.3971635  0.1044665
##   13     0.1333100  0.3984365  0.1044862
##   14     0.1332473  0.3991241  0.1044555
##   15     0.1332402  0.3992458  0.1043565
##   16     0.1332062  0.3995973  0.1044712
##   17     0.1331298  0.4003092  0.1043198
##   18     0.1331790  0.3999197  0.1042926
##   19     0.1330622  0.4009587  0.1042041
##   20     0.1330692  0.4009621  0.1042114
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 19.

plot(pls.model)

pls.pred <- predict(pls.model, newdata = test.x)
postResample(pred = pls.pred, obs = test.y)

##      RMSE  Rsquared       MAE 
## 0.1398721 0.3749599 0.1047261

MARS

Construct a Multivariate Adaptive Regression Spline Model with the training set and evaluate against the test set.

set.seed(200)

mars.model <- train(x = train.x, 
                    y = train.y,
                    method='earth',
                    tuneGrid = expand.grid(.degree = 1:2, .nprune = 2:15),
                    preProcess = c('center','scale'),
                    tuneLength = 10,
                    trControl = trainControl(method='cv'))

mars.model

## Multivariate Adaptive Regression Spline 
## 
## 2058 samples
##   31 predictor
## 
## Pre-processing: centered (30), scaled (30), ignore (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE       
##   1        2      0.1519593  0.2173347  0.11981987
##   1        3      0.1436064  0.3027557  0.11341513
##   1        4      0.1424996  0.3134170  0.11209297
##   1        5      0.1412752  0.3250453  0.11126242
##   1        6      0.1390703  0.3454738  0.10888656
##   1        7      0.1373233  0.3618968  0.10741878
##   1        8      0.1358347  0.3754513  0.10614550
##   1        9      0.1348121  0.3855048  0.10475480
##   1       10      0.1332075  0.3998987  0.10384130
##   1       11      0.1322520  0.4081996  0.10266059
##   1       12      0.1318186  0.4122003  0.10224431
##   1       13      0.1300985  0.4274200  0.10114811
##   1       14      0.1298456  0.4295064  0.10086508
##   1       15      0.1292671  0.4343528  0.10048585
##   2        2      0.1519593  0.2173347  0.11981987
##   2        3      0.1444026  0.2955454  0.11423636
##   2        4      0.1423315  0.3147275  0.11232971
##   2        5      0.1400932  0.3363659  0.11049135
##   2        6      0.1388901  0.3474765  0.10883481
##   2        7      0.1377243  0.3596276  0.10770676
##   2        8      0.1361951  0.3737595  0.10571512
##   2        9      0.1351186  0.3846038  0.10471649
##   2       10      0.1330222  0.4038920  0.10284740
##   2       11      0.1310934  0.4204518  0.10163345
##   2       12      0.1294455  0.4342111  0.09953203
##   2       13      0.1283521  0.4441439  0.09859935
##   2       14      0.1270568  0.4550686  0.09770187
##   2       15      0.1263514  0.4610164  0.09681491
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.

mars.pred <- predict(mars.model, newdata = test.x)
postResample(pred = mars.pred, obs = test.y)

##      RMSE  Rsquared       MAE 
## 0.1366891 0.4027412 0.1001467

BT

Construct a Boosted Tree Model with the training set and evaluate against the test set.

set.seed(200)

gbm.model <- train(train.x, train.y,
                 method="gbm",
                 trControl=trainControl(method="cv",n=10),
                 tuneGrid=expand.grid(interaction.depth = seq(1,7,by=2),
                                      n.trees=seq(100, 1000, by=50),
                                      shrinkage=c(0.01,0.1,by=0.01),
                                      n.minobsinnode=10),
                 verbose=FALSE)
gbm.model

## Stochastic Gradient Boosting 
## 
## 2058 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE       
##   0.01       1                   100     0.1523530  0.3119592  0.12203182
##   0.01       1                   150     0.1474505  0.3415488  0.11788436
##   0.01       1                   200     0.1438972  0.3605049  0.11490128
##   0.01       1                   250     0.1411843  0.3738853  0.11272101
##   0.01       1                   300     0.1390103  0.3840941  0.11094965
##   0.01       1                   350     0.1374860  0.3897144  0.10971008
##   0.01       1                   400     0.1363454  0.3942788  0.10877506
##   0.01       1                   450     0.1354198  0.3990871  0.10801896
##   0.01       1                   500     0.1347201  0.4025222  0.10745010
##   0.01       1                   550     0.1340625  0.4070181  0.10691967
##   0.01       1                   600     0.1334963  0.4107255  0.10646798
##   0.01       1                   650     0.1329929  0.4143471  0.10604720
##   0.01       1                   700     0.1325657  0.4174464  0.10565816
##   0.01       1                   750     0.1321634  0.4202061  0.10530869
##   0.01       1                   800     0.1317646  0.4230349  0.10498552
##   0.01       1                   850     0.1314161  0.4254842  0.10467934
##   0.01       1                   900     0.1311301  0.4274881  0.10442771
##   0.01       1                   950     0.1308327  0.4298861  0.10417909
##   0.01       1                  1000     0.1305332  0.4320379  0.10389071
##   0.01       3                   100     0.1415708  0.4190988  0.11348201
##   0.01       3                   150     0.1352931  0.4413851  0.10819917
##   0.01       3                   200     0.1312982  0.4567712  0.10479737
##   0.01       3                   250     0.1287164  0.4690652  0.10247155
##   0.01       3                   300     0.1268598  0.4790437  0.10081947
##   0.01       3                   350     0.1253208  0.4883158  0.09945864
##   0.01       3                   400     0.1241145  0.4956628  0.09833239
##   0.01       3                   450     0.1232290  0.5011062  0.09746881
##   0.01       3                   500     0.1223970  0.5066522  0.09661292
##   0.01       3                   550     0.1216553  0.5113117  0.09586007
##   0.01       3                   600     0.1209650  0.5159266  0.09513149
##   0.01       3                   650     0.1204203  0.5193619  0.09455273
##   0.01       3                   700     0.1198866  0.5229287  0.09396714
##   0.01       3                   750     0.1194500  0.5258525  0.09347293
##   0.01       3                   800     0.1190126  0.5285590  0.09304575
##   0.01       3                   850     0.1185584  0.5317001  0.09256604
##   0.01       3                   900     0.1181479  0.5345187  0.09214940
##   0.01       3                   950     0.1178040  0.5366624  0.09177567
##   0.01       3                  1000     0.1174374  0.5391947  0.09139282
##   0.01       5                   100     0.1370860  0.4676510  0.10932691
##   0.01       5                   150     0.1302670  0.4879111  0.10339200
##   0.01       5                   200     0.1259103  0.5044192  0.09964645
##   0.01       5                   250     0.1230882  0.5170579  0.09712308
##   0.01       5                   300     0.1211207  0.5267196  0.09530564
##   0.01       5                   350     0.1195372  0.5358168  0.09382259
##   0.01       5                   400     0.1182392  0.5434687  0.09258626
##   0.01       5                   450     0.1172179  0.5494681  0.09151781
##   0.01       5                   500     0.1164673  0.5533239  0.09076463
##   0.01       5                   550     0.1157392  0.5576540  0.09005657
##   0.01       5                   600     0.1151108  0.5614288  0.08948031
##   0.01       5                   650     0.1145076  0.5652082  0.08885590
##   0.01       5                   700     0.1140258  0.5680074  0.08835541
##   0.01       5                   750     0.1135324  0.5711834  0.08787691
##   0.01       5                   800     0.1130509  0.5743438  0.08738190
##   0.01       5                   850     0.1125362  0.5776265  0.08688094
##   0.01       5                   900     0.1121588  0.5800251  0.08656658
##   0.01       5                   950     0.1117764  0.5823843  0.08615613
##   0.01       5                  1000     0.1115184  0.5838670  0.08582951
##   0.01       7                   100     0.1342139  0.4982731  0.10684194
##   0.01       7                   150     0.1270261  0.5167426  0.10057407
##   0.01       7                   200     0.1224944  0.5327363  0.09654200
##   0.01       7                   250     0.1194406  0.5461478  0.09381487
##   0.01       7                   300     0.1174010  0.5556221  0.09186063
##   0.01       7                   350     0.1157821  0.5640629  0.09032519
##   0.01       7                   400     0.1145704  0.5706573  0.08914123
##   0.01       7                   450     0.1135441  0.5765679  0.08812839
##   0.01       7                   500     0.1126344  0.5819316  0.08727011
##   0.01       7                   550     0.1119832  0.5855958  0.08658761
##   0.01       7                   600     0.1113210  0.5891498  0.08595459
##   0.01       7                   650     0.1107006  0.5928040  0.08534412
##   0.01       7                   700     0.1102049  0.5957555  0.08482307
##   0.01       7                   750     0.1097457  0.5984771  0.08436663
##   0.01       7                   800     0.1092611  0.6013617  0.08391944
##   0.01       7                   850     0.1088465  0.6040215  0.08344786
##   0.01       7                   900     0.1085133  0.6060229  0.08307932
##   0.01       7                   950     0.1081617  0.6082807  0.08275758
##   0.01       7                  1000     0.1078023  0.6105489  0.08238882
##   0.10       1                   100     0.1305678  0.4312207  0.10395219
##   0.10       1                   150     0.1282054  0.4477566  0.10143194
##   0.10       1                   200     0.1274292  0.4523396  0.10037926
##   0.10       1                   250     0.1267606  0.4565717  0.09936376
##   0.10       1                   300     0.1264646  0.4584828  0.09888923
##   0.10       1                   350     0.1261117  0.4606947  0.09843724
##   0.10       1                   400     0.1259148  0.4620522  0.09786859
##   0.10       1                   450     0.1256598  0.4642112  0.09759580
##   0.10       1                   500     0.1256141  0.4645380  0.09753160
##   0.10       1                   550     0.1254855  0.4654027  0.09740397
##   0.10       1                   600     0.1253991  0.4661349  0.09733180
##   0.10       1                   650     0.1254600  0.4657393  0.09728092
##   0.10       1                   700     0.1255502  0.4650022  0.09740311
##   0.10       1                   750     0.1255903  0.4646530  0.09730425
##   0.10       1                   800     0.1256046  0.4647453  0.09729416
##   0.10       1                   850     0.1256222  0.4649123  0.09723545
##   0.10       1                   900     0.1255960  0.4651797  0.09715294
##   0.10       1                   950     0.1254433  0.4663054  0.09695375
##   0.10       1                  1000     0.1253694  0.4669895  0.09700719
##   0.10       3                   100     0.1173833  0.5370013  0.09132690
##   0.10       3                   150     0.1157319  0.5474978  0.08937931
##   0.10       3                   200     0.1143793  0.5568460  0.08791406
##   0.10       3                   250     0.1135167  0.5630276  0.08693628
##   0.10       3                   300     0.1127019  0.5690655  0.08618343
##   0.10       3                   350     0.1121714  0.5730238  0.08586251
##   0.10       3                   400     0.1120209  0.5743261  0.08561256
##   0.10       3                   450     0.1116947  0.5763781  0.08509327
##   0.10       3                   500     0.1114750  0.5782678  0.08495813
##   0.10       3                   550     0.1110443  0.5813300  0.08460982
##   0.10       3                   600     0.1108103  0.5832841  0.08445083
##   0.10       3                   650     0.1110192  0.5816803  0.08461997
##   0.10       3                   700     0.1106782  0.5844288  0.08421633
##   0.10       3                   750     0.1106600  0.5847471  0.08392635
##   0.10       3                   800     0.1102981  0.5874271  0.08358527
##   0.10       3                   850     0.1102917  0.5876629  0.08353160
##   0.10       3                   900     0.1100556  0.5897448  0.08343610
##   0.10       3                   950     0.1098820  0.5910939  0.08341153
##   0.10       3                  1000     0.1097981  0.5918695  0.08328473
##   0.10       5                   100     0.1134493  0.5661261  0.08746607
##   0.10       5                   150     0.1113925  0.5802328  0.08472383
##   0.10       5                   200     0.1106609  0.5844985  0.08422365
##   0.10       5                   250     0.1095775  0.5924311  0.08344208
##   0.10       5                   300     0.1087468  0.5983458  0.08278681
##   0.10       5                   350     0.1084243  0.6008193  0.08239929
##   0.10       5                   400     0.1081081  0.6032341  0.08213135
##   0.10       5                   450     0.1081318  0.6029802  0.08203610
##   0.10       5                   500     0.1081481  0.6030625  0.08196245
##   0.10       5                   550     0.1079318  0.6044821  0.08174909
##   0.10       5                   600     0.1076626  0.6066524  0.08142384
##   0.10       5                   650     0.1074597  0.6081775  0.08129023
##   0.10       5                   700     0.1072053  0.6101711  0.08114198
##   0.10       5                   750     0.1072177  0.6102686  0.08110459
##   0.10       5                   800     0.1072049  0.6105270  0.08100676
##   0.10       5                   850     0.1071836  0.6107098  0.08100749
##   0.10       5                   900     0.1072853  0.6101251  0.08110176
##   0.10       5                   950     0.1072663  0.6103520  0.08114777
##   0.10       5                  1000     0.1073540  0.6095887  0.08129552
##   0.10       7                   100     0.1100874  0.5904037  0.08385399
##   0.10       7                   150     0.1078922  0.6060368  0.08170157
##   0.10       7                   200     0.1068334  0.6129153  0.08063340
##   0.10       7                   250     0.1065951  0.6142656  0.08016653
##   0.10       7                   300     0.1065075  0.6148422  0.07981809
##   0.10       7                   350     0.1062637  0.6167367  0.07974546
##   0.10       7                   400     0.1065941  0.6143933  0.07994070
##   0.10       7                   450     0.1063581  0.6161393  0.07975480
##   0.10       7                   500     0.1063456  0.6161974  0.07968278
##   0.10       7                   550     0.1060731  0.6183322  0.07940382
##   0.10       7                   600     0.1062205  0.6172598  0.07944048
##   0.10       7                   650     0.1062470  0.6171154  0.07947978
##   0.10       7                   700     0.1062981  0.6167126  0.07952513
##   0.10       7                   750     0.1064330  0.6158421  0.07963543
##   0.10       7                   800     0.1063724  0.6163675  0.07953334
##   0.10       7                   850     0.1062828  0.6171753  0.07945210
##   0.10       7                   900     0.1061756  0.6179672  0.07941516
##   0.10       7                   950     0.1061899  0.6180009  0.07940876
##   0.10       7                  1000     0.1061546  0.6183485  0.07939927
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 550, interaction.depth =
##  7, shrinkage = 0.1 and n.minobsinnode = 10.

gbm.pred <- predict(gbm.model, newdata = test.x)
postResample(pred = gbm.pred, obs = test.y)

##       RMSE   Rsquared        MAE 
## 0.11542839 0.57373535 0.08164853

RF

Construct a Random Forest Model with the training set and evaluate against the test set.

set.seed(200)

rf.model <- randomForest(train.x, train.y,
                         importance=TRUE,
                         ntree=2000)
rf.model

## 
## Call:
##  randomForest(x = train.x, y = train.y, ntree = 2000, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 0.009253301
##                     % Var explained: 68.52

rf.pred <- predict(rf.model, newdata = test.x)
postResample(pred = rf.pred, obs = test.y)

##       RMSE   Rsquared        MAE 
## 0.10971015 0.63248324 0.07551098

CART

Construct a Classification and Regression Tree Model with the training set and evaluate against the test set.

set.seed(200)

cart.model <- train(train.x, train.y,
                    method="rpart",
                    tuneLength = 10,
                    trControl=trainControl(method="cv"))
cart.model

## CART 
## 
## 2058 samples
##   31 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.01185559  0.1296635  0.4308978  0.1010382
##   0.01339721  0.1301769  0.4256272  0.1014535
##   0.01441989  0.1322061  0.4074353  0.1036957
##   0.01445036  0.1322061  0.4074353  0.1036957
##   0.01823303  0.1351298  0.3810924  0.1059655
##   0.02174111  0.1372140  0.3621558  0.1084246
##   0.03349671  0.1401799  0.3338156  0.1109589
##   0.04196751  0.1435611  0.3018764  0.1144110
##   0.07211729  0.1495527  0.2418983  0.1176743
##   0.21348839  0.1672073  0.1635403  0.1340677
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01185559.

cart.pred <- predict(cart.model, newdata = test.x)
postResample(pred = cart.pred, obs = test.y)

##      RMSE  Rsquared       MAE 
## 0.1376576 0.3937274 0.1040229

SVM

Construct an SVM Model with the training set and evaluate against the test set. This model requires that the categorical predictor, Brand.Code, be removed.

set.seed(200)

svm.model <- train(train.x[,-1], train.y,
                   method = 'svmRadial',
                   preProcess = c('center','scale'),
                   tuneLength = 10,
                   trControl = trainControl(method = 'cv'))
svm.model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2058 samples
##   30 predictor
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE       
##     0.25  0.1317637  0.4222040  0.09825709
##     0.50  0.1288566  0.4449897  0.09477889
##     1.00  0.1266608  0.4624419  0.09214453
##     2.00  0.1249498  0.4757384  0.09054553
##     4.00  0.1236949  0.4864580  0.08973337
##     8.00  0.1233301  0.4929074  0.08985149
##    16.00  0.1253959  0.4844122  0.09207591
##    32.00  0.1286991  0.4712901  0.09485684
##    64.00  0.1322165  0.4589403  0.09822307
##   128.00  0.1373938  0.4406352  0.10247597
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02503082
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02503082 and C = 8.

svm.pred = predict(svm.model, newdata = test.x[,-1])
postResample(pred = svm.pred, obs = test.y)

##       RMSE   Rsquared        MAE 
## 0.12551989 0.49840999 0.08851726

Model Results

results <- data.frame(as.list(postResample(pred = lm.pred, obs = test.y))) |> mutate(model = 'LM') |>
  union(data.frame(as.list(postResample(pred = pls.pred, obs = test.y))) |> mutate(model = 'PLS')) |>
  union(data.frame(as.list(postResample(pred = mars.pred, obs = test.y))) |> mutate(model = 'MARS')) |>
  union(data.frame(as.list(postResample(pred = gbm.pred, obs = test.y))) |> mutate(model = 'TREE')) |>
  union(data.frame(as.list(postResample(pred = rf.pred, obs = test.y))) |> mutate(model = 'RF')) |>
  union(data.frame(as.list(postResample(pred = cart.pred, obs = test.y))) |> mutate(model = 'CART')) |>
  union(data.frame(as.list(postResample(pred = svm.pred, obs = test.y))) |> mutate(model = 'SVM')) |>
  relocate(model, RMSE, Rsquared, MAE)

results |>
  arrange(desc(Rsquared))

##   model      RMSE  Rsquared        MAE
## 1    RF 0.1097102 0.6324832 0.07551098
## 2  TREE 0.1154284 0.5737354 0.08164853
## 3   SVM 0.1255199 0.4984100 0.08851726
## 4  MARS 0.1366891 0.4027412 0.10014672
## 5  CART 0.1376576 0.3937274 0.10402294
## 6    LM 0.1396969 0.3764328 0.10433701
## 7   PLS 0.1398721 0.3749599 0.10472612

The Random Forest Model has the highest \(R^2\) value and will be chosen to model the PH values.

Model Evaluation

Predictor Importance

Looking at the top 10 predictors we see that Brand.Code and Mnf.Flow are the two most important predictors for determining the PH.

top10 <- varImp(rf.model) |>
  arrange(desc(Overall)) |>
  head(10)

varImpPlot(rf.model, n.var = 10)

Correlation for Important Predictors

Now that we know which predictors are most important for determining the PH of the beverage, we can look at the correlation matrix to see how they relate.

impute.df |>
  select(c('PH', row.names(top10))) |>
  keep(is.numeric) |>
  cor() |>
  corrplot()

Forecast PH

Impute Missing Values

We will impute missing values in the evaluation data set using the same method as the training set. Again we will remove Hyd.Pressure because it has near-zero variance.

set.seed(200)

eval.impute <- eval.df |>
  select(-PH) |>
  mice(m = 1, method = 'pmm', print = FALSE) |>
  complete() |>
  mutate(PH = '') |>
  select(-Hyd.Pressure1)

Predict PH

Using the Random Forest Model, we will predict the PH in the evaluation data set.

eval.pred <- predict(rf.model, newdata = eval.impute)
head(eval.pred, 10)

##        1        2        3        4        5        6        7        8 
## 8.571348 8.461156 8.548605 8.584590 8.517460 8.535009 8.490136 8.578362 
##        9       10 
## 8.546094 8.629645

Create Output

Insert the computed PH into the original evaluation data set and export to Excel.

pred.df <- eval.df |>
  mutate(PH = eval.pred)

pred.df |>
  write.xlsx('StudentEvaluationPreds.xlsx')

Conclusion

After evaluating each model, the Random Forest model was determined to be most effective on “PH”. The prediction results fall into the same range of the initial data and therefore fits the data well. This proves we can forecast the evaluation using the student data. For future research, it may be worth investigating Neural Networks and see if it yields better results.

Project 2

Brett Davidoff, Donald Butler, Tyler Brown

2023-12-15