pacman::p_load(MASS,tidyverse,DataExplorer,skimr,Amelia,naniar,RANN,caret,car,janitor,
     
                         missForest,caTools,earth,finalfit,inspectdf,xlsx)

df <- readxl::read_xlsx("Data/StudentData.xlsx")
eval <-  readxl::read_xlsx("Data/StudentEvaluation.xlsx")

1 Introduction

For this assignment our group was tasked with understanding the manufacturing process at ABC Beverages. Using historical data, we seek to build predictive models that will illuminate the key processes in predicting PH. The report here will be technical in nature, we shall assume that the reader is familiar with the statistical terminology and the various models used in this report. For a simpler, non technical breakdown of our findings, a non technical report will also be included. That report will also go into more detail over the top most important predictors. For this report we will be exploring the structure of the data. The focus will be on building Linear, Non-Linear, and Tree models and finding best predictive models for each. We will then analyze which variables were the most important to those models.

2 Data Exploration

We begin by exploring the data itself. Shown is the historical data set we will be using. It is clear on initial inspection that there are missing values and that all our columns are numeric besides the one categorical Brand Code column.

df

df$`Brand Code` <- as.factor(df$`Brand Code`)
eval$`Brand Code` <- as.factor(eval$`Brand Code`)

We further inspect the values, discovering various key features. To begin with, it would appear the majority of our columns are missing at least one or more values. Fortunately, there are no columns with an excessive amount of missingness with the worst example still having a completeness rate of .918. We also note that the most common Brand code is B, which makes up roughly half the data making it way more common than any other brand in this data set. It would also appear that many of numeric values are either very skewed, or have very little variance.

skim(df)

Data summary
Name	df
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
factor	1
numeric	32
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Brand Code	120	0.95	FALSE	4	B: 1239, D: 615, C: 304, A: 293

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

Looking at some of the top missing values, it would not appear as if a clear pattern for the missingness exists. We note that the data is most likely missing at random. We also run a statistical to be certain that the data is not missing completely at random. Given the high statistic value and low p-value, we can conclude the data is not missing completely at random.

As an additional note, we also chose to remove the 4 missing PH values. As our objective was to predict PH itself, it did not make sense to impute those values and with how few are missing, we decided it wouldn’t change the results in any significant way.

df %>%
  missing_pattern('PH',c('MFR','Brand Code','Carb Pressure1','PSC CO2','Filler Speed','PC Volume'))

     PH Carb Pressure1 PC Volume PSC CO2 Filler Speed Brand Code MFR    
2160  1              1         1       1            1          1   1   0
138   1              1         1       1            1          1   0   1
100   1              1         1       1            1          0   1   1
10    1              1         1       1            1          0   0   2
3     1              1         1       1            0          1   1   1
47    1              1         1       1            0          1   0   2
3     1              1         1       1            0          0   0   3
31    1              1         1       0            1          1   1   1
5     1              1         1       0            1          1   0   2
2     1              1         1       0            1          0   1   2
26    1              1         0       1            1          1   1   1
4     1              1         0       1            1          1   0   2
4     1              1         0       1            1          0   1   2
1     1              1         0       1            0          1   0   3
1     1              1         0       0            1          1   1   2
28    1              0         1       1            1          1   1   1
1     1              0         1       1            1          0   1   2
3     1              0         0       1            1          1   1   2
1     0              1         1       1            1          1   0   2
3     0              1         1       1            0          1   0   3
      4             32        39      39           57        120 212 503

df <- df[!is.na(df$PH),]

mcar_test(df)

Looking at our values by brand, we do find some values are different in distribution when broken up by brand. The missing values are not clearly one brand or another. However, with how skewed some of the numeric values are, it is still possible that these values all mostly one brand or another.

plot_boxplot(df,by="Brand Code")

corr_simple <- function(data=df,sig=0.5){
  
  library(corrplot)
  #convert data to numeric in order to run correlations
  #convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
  df_cor <- data %>% mutate_if(is.character, as.factor)
  df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)  #run a correlation and drop the insignificant ones
  corr <- cor(df_cor)
  #prepare to drop duplicates and correlations of 1     
  corr[lower.tri(corr,diag=TRUE)] <- NA 
  #drop perfect correlations
  corr[corr == 1] <- NA   #turn into a 3-column table
  corr <- as.data.frame(as.table(corr))
  #remove the NA values from above 
  corr <- na.omit(corr)   #select significant values  
  corr <- subset(corr, abs(Freq) > sig) 
  #sort by highest correlation
  corr <- corr[order(-abs(corr$Freq)),]   #print table
  print(corr)  #turn corr back into matrix in order to plot with corrplot
  mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
  
  #plot correlations visually
  corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ",method =c('number'))
}

Next, we look at the top most correlated values in our data. With how many cross correlations there are within this data set, we decided it was prudent to only examine the top most correlations. Doing so shows that many of the predictors are highly correlated with one another.

corr_simple(na.omit(df))

              Var1              Var2       Freq
1080       Balling       Balling Lvl  0.9884121
908   Filler Level     Bowl Setpoint  0.9772417
1078       Density       Balling Lvl  0.9564173
781        Density           Balling  0.9531276
744   Filler Speed               MFR  0.9512644
1087      Alch Rel       Balling Lvl  0.9473433
1014       Balling          Alch Rel  0.9457260
1012       Density          Alch Rel  0.9229932
476  Hyd Pressure2     Hyd Pressure3  0.9176619
1054      Alch Rel          Carb Rel  0.8827915
1088      Carb Rel       Balling Lvl  0.8721483
1045       Density          Carb Rel  0.8603222
1047       Balling          Carb Rel  0.8589906
1025   Carb Volume          Carb Rel  0.8410722
992    Carb Volume          Alch Rel  0.8247876
170  Carb Pressure         Carb Temp  0.8243350
761    Carb Volume           Balling  0.8187788
1058   Carb Volume       Balling Lvl  0.8179933
936  Fill Pressure Pressure Setpoint  0.8152856
695    Carb Volume           Density  0.8018088
472       Mnf Flow     Hyd Pressure3  0.7641428
442  Hyd Pressure1     Hyd Pressure2  0.7118513
1006 Hyd Pressure4          Alch Rel -0.6855760
807  Hyd Pressure3   Pressure Vacuum -0.6722583
991     Brand Code          Alch Rel  0.6666969
439       Mnf Flow     Hyd Pressure2  0.6483379
806  Hyd Pressure2   Pressure Vacuum -0.6281675
496     Brand Code     Hyd Pressure4 -0.6248840
901       Mnf Flow     Bowl Setpoint -0.6115804
538       Mnf Flow      Filler Level -0.6083807
475  Hyd Pressure1     Hyd Pressure3  0.6055890
775  Hyd Pressure4           Balling -0.5949966
709  Hyd Pressure4           Density -0.5843820
1039 Hyd Pressure4          Carb Rel -0.5818275
1072 Hyd Pressure4       Balling Lvl -0.5750443
497    Carb Volume     Hyd Pressure4 -0.5679086
637       Mnf Flow        Usage cont  0.5601926
802       Mnf Flow   Pressure Vacuum -0.5599433
868       Mnf Flow     Oxygen Filler -0.5500664
1024    Brand Code          Carb Rel  0.5375376
34      Brand Code       Carb Volume  0.5337574
906  Hyd Pressure3     Bowl Setpoint -0.5292159
543  Hyd Pressure3      Filler Level -0.5279659
760     Brand Code           Balling  0.5007730

We calculate the variance-inflation and generalized variance-inflation factors. Doing so shows that there are possible collinearity issues among our predictors. This is not necessarily an issue for all models, but make careful note of this regardless.

vif(lm(PH~.,df,na.action = na.exclude))

                          GVIF Df GVIF^(1/(2*Df))
`Brand Code`        192.548930  3        2.403017
`Carb Volume`        18.044215  1        4.247848
`Fill Ounces`         1.190455  1        1.091080
`PC Volume`           1.653163  1        1.285754
`Carb Pressure`      42.681219  1        6.533087
`Carb Temp`          35.506521  1        5.958735
PSC                   1.164174  1        1.078969
`PSC Fill`            1.112638  1        1.054817
`PSC CO2`             1.069956  1        1.034387
`Mnf Flow`            5.077635  1        2.253361
`Carb Pressure1`      1.485310  1        1.218733
`Fill Pressure`       3.804648  1        1.950551
`Hyd Pressure1`       3.065348  1        1.750813
`Hyd Pressure2`      10.892929  1        3.300444
`Hyd Pressure3`      12.954982  1        3.599303
`Hyd Pressure4`       3.122824  1        1.767151
`Filler Level`       25.175483  1        5.017518
`Filler Speed`       11.191642  1        3.345391
Temperature           1.343282  1        1.159000
`Usage cont`          1.768304  1        1.329776
`Carb Flow`           2.348573  1        1.532506
Density              17.059399  1        4.130303
MFR                  11.479327  1        3.388116
Balling             175.704876  1       13.255372
`Pressure Vacuum`     4.291585  1        2.071614
`Oxygen Filler`       1.567662  1        1.252063
`Bowl Setpoint`      25.827609  1        5.082087
`Pressure Setpoint`   3.597044  1        1.896587
`Air Pressurer`       1.242666  1        1.114749
`Alch Rel`           52.309517  1        7.232532
`Carb Rel`            7.657658  1        2.767247
`Balling Lvl`       210.175794  1       14.497441

3 Modeling

3.1 Initial Processing

The modeling process begins with cleaning the names of our predictor values to make the modeling process easier. We then create a train and test split. We do so prior to imputing our variables so that we can create a hopefully unbiased set of test data that can be used to verify how our models function.

df_clean <- clean_names(df)

eval_clean <-   clean_names(eval)

set.seed(125)
sample <- sample.split(df_clean$ph, SplitRatio = 0.8)
train  <- subset(df_clean, sample == TRUE)
test   <- subset(df_clean, sample == FALSE)

We created two main versions of data to build and test our models with. The first training set has missing numeric values filled in using KNN Impute and the we use missing forest to fill in the missing categorical values. The second set of data uses the same process for imputation, but uses the corr method to remove any columns that are overly correlated with one another.

set.seed(122)
impute <- preProcess(as.data.frame(train), method = "knnImpute")

train_imputeKNN <- predict(impute, train)

train_tree_KNN_impute <- missForest(as.data.frame(train_imputeKNN))
train_tree_KNN  <- train_tree_KNN_impute$ximp 
 

test_imputeKNN <- predict(impute, test)

test_tree_KNN_impute <- missForest(as.data.frame(test_imputeKNN))
test_tree_KNN  <- test_tree_KNN_impute$ximp 

 impute_eval <-  predict(impute, eval_clean)
tree_eval <-   missForest(as.data.frame(impute_eval))$ximp

set.seed(122)

impute_c <- preProcess(as.data.frame(train), method = c("knnImpute","corr"))

train_imputeKNN_c <- predict(impute_c, as.data.frame(train))
 
 
train_tree_KNN_impute_c <- missForest(as.data.frame(train_imputeKNN_c))
train_tree_KNN_c  <- train_tree_KNN_impute_c$ximp
 
 
test_imputeKNN_c <- predict(impute_c, as.data.frame(test))
 
test_tree_KNN_impute_c <- missForest(as.data.frame(test_imputeKNN_c))
test_tree_KNN_c  <- test_tree_KNN_impute_c$ximp

A third set of imputation data was created as well. This data used KNN imputation for numeric values, but we filled in the missing Brand Values with an N. We found that this data worked best for the Tree models, so it was chosen to be used for those models.

# Keep same train/test split as above models, but preprocess slightly differently.
set.seed(122)
# Copy the d.f's so as not to mess with future predictions for previous models
train2 = train
test2 = test
eval2 = eval_clean
# Convert NA brand_code to a new brand "N" instead of KNN-imputing it.
levels(train2$brand_code) = c(levels(train2$brand_code), "N")
levels(test2$brand_code) = c(levels(test2$brand_code), "N")
levels(eval2$brand_code) = c(levels(test2$brand_code), "N")

train2$brand_code[is.na(train2$brand_code)] = "N"
test2$brand_code[is.na(test2$brand_code)] = "N"
eval2$brand_code[is.na(eval2$brand_code)] = "N"




# Now use same imputation on remaining NA's 

impute2 <- preProcess(as.data.frame(train2), method = "knnImpute")
 train_imputeKNN2 <- predict(impute2, train2)

# Note: using training imputer to impute test NA's
test_imputeKNN2 <- predict(impute2, test2)
tree_eval_alt <-   predict(impute2,eval2)

3.2 Linear Modeling

3.2.1 Linear Model

lm_cols <- stepAIC(lm(ph ~ ., train_tree_KNN), trace=F)
set.seed(128)
lm_model <- train(train_tree_KNN[,colnames(lm_cols$model)[colnames(lm_cols$model) != "ph"]],
                  train_tree_KNN$ph,
                  method='lm',metric='Rsquared', trControl = ctrl)
resample_lm1 <- postResample(predict(lm_model$finalModel,  test_tree_KNN), test_tree_KNN$ph)

We start with the linear model and use a backward step-wise regression model using the tree_KNN pre-processed data. The model gives us the following metrics: RMSE - 0.765, R-squared - 0.419, and MAE - 0.595. Below, we check the residual plots to make sure that there are no patterns:

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(lm_model$finalModel)

The test data for this model gives us the following metrics: RMSE - 0.774, R-squared - 0.378, and MAE - 0.602. We see that Linear Model 1 slightly over fits the PH data since it does not predict the PH values in the test data as well as in the training data. Next, we attempt a second model using the train_tree_KNN_c data. Again we use a step wise regression model, which gives us the following metrics:

lm2_col <- stepAIC(lm(ph~.,train_tree_KNN_c),trace=F)
set.seed(120)
lm_model2 <- train(train_tree_KNN_c[,colnames(lm2_col$model)[colnames(lm2_col$model) != "ph"]],
                  train_tree_KNN_c$ph,
                  method='lm',metric='Rsquared', trControl = ctrl)
resample_lm2 <- postResample(predict(lm_model2$finalModel,test_tree_KNN_c), test_tree_KNN_c$ph)

RMSE - 0.775, R-squared - 0.403, and MAE - 0.606. Below, we check the residual plots to make sure that there are no patterns:

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(lm_model2$finalModel)

The test data for this model gives us the following metrics: RMSE - 0.778, R-squared - 0.371, and MAE - 0.605. We see that Linear Model 2 also slightly over fits the PH data since it does not predict the PH values in the test data as well as in the training data. Given that Linear Model 1 performs better on the training and test data there doesn’t seem to be a need to pre-process the data by removing the highly correlated variables. The residual plot for Linear Model 2 looked just a little bit better, but not enough to warrant discarding Linear Model 1.

3.2.2 PLS Model

We attempt to get a higher R-squared by using different linear regression models on both data sets. The first model type we attempt is the Partial Least Square (PLS) model. Below we see the results of training a model from 1 component to 15 components.

set.seed(120)
PLS_model <- train(ph~.,
                   data=train_tree_KNN,
                   method='pls',metric='Rsquared',
                   tuneLength = 15, trControl = ctrl)
PLS_model1 <- train(ph~.,
                   data=train_tree_KNN_c,
                   method='pls',metric='Rsquared',
                   tuneLength = 15, trControl = ctrl)

plot(PLS_model,main=latex2exp::TeX("PLS model 1 $R^2$"))

plot(PLS_model1,main=latex2exp::TeX("PLS model 2 $R^2$"))

resample_PLS1 <- postResample(predict(PLS_model,test_tree_KNN),test_tree_KNN$ph)
resample_PLS2 <- postResample(predict(PLS_model1,test_tree_KNN_c),test_tree_KNN_c$ph)

PLS Model 1 achieves it’s highest R-squared of 0.4138 after using 14 components. PLS Model 2, on the other hand, achieves its highest R-squared of 0.3994 after using 14 components.

The re-sampled R2 for PLS was 0.3813 while PLS model 2 had a resampled R2 of 0.3811. Given that the PLS 1 model performed better than the Linear models and the PLS 2 model on the test data, we will consider this the best Linear model so far. Next we will try a ridge regression.

3.2.3 Ridge Regression

Given that there seem to be many highly correlated predictors, we decided to try a penalized regression model as a way to mitigate these correlations. Below we see the plots that show which lambda produces the highest R-squared.

set.seed(120)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 10))
ridge_model <- train(ph~.,data=train_tree_KNN,
                     method = "ridge",metric='Rsquared',
                     tuneGrid = ridgeGrid,trControl = ctrl)
ridge_model1 <- train(ph~.,data=train_tree_KNN_c,
                     method = "ridge",metric='Rsquared',
                     tuneGrid = ridgeGrid,trControl = ctrl)

plot(ridge_model,main=latex2exp::TeX("Ridge model 1 $R^2$"))

plot(ridge_model1,main=latex2exp::TeX("Ridge model 2 $R^2$"))

resample_Ridge1 <- postResample(predict(ridge_model,test_tree_KNN),test_tree_KNN$ph)
resample_Ridge2 <- postResample(predict(ridge_model1,test_tree_KNN_c),test_tree_KNN_c$ph)

For both Ridge Model 1 and Ridge Model 2 a lambda of 0.011 gives us the best R-squared for each model of 0.4133 and 0.3975, respectively. The R2 for the test data on the Ridge Regression 1 was 0.3838, while the for Ridge Regression 2 it was 0.3825.

Given that these re-sampled R-squared values for the Ridge Regression 1 model are higher than the PLS model 1 and therefore higher than any of the OLS regression, we will be selecting the Ridge Regression 1 as our best linear regression model.

3.3 Non-Linear Modeling

  tuning = trainControl(
                  method = "cv", number = 5)

3.3.1 KNN Modeling

For our non-linear modeling, we began with the K-nearest neighbors algorithm. We see the result of the model below, it appears to perform better than the linear models tested in the previous section.

set.seed(124)
knnModel <- train( ph~., 
                    data = train_tree_KNN,
                    method = "knn",
            trControl = tuning,
              tuneLength = 10)
knnModel2 <- train( ph~., 
                    data = train_tree_KNN_c,
                    method = "knn",
            trControl = tuning,
              tuneLength = 10)

postResample(predict(knnModel,  test_tree_KNN), test_tree_KNN$ph)

     RMSE  Rsquared       MAE 
0.6791006 0.5250011 0.5177956

postResample(predict(knnModel2,  test_tree_KNN_c), test_tree_KNN_c$ph)

     RMSE  Rsquared       MAE 
0.6811649 0.5241601 0.5180207

3.3.2 Neural Network

For the next non-linear model, we tested a neural network. We found the below parameters to be the most optimal for training our net. It would appear that this model performed much better than KNN. It took much longer to load however, so we saved the model to be reloaded in when needed.

cutoff <- findCorrelation(cor(select(train_tree_KNN,!brand_code)), cutoff = .75)
train_bevnet <- train_tree_KNN[, -cutoff]
cutoff2 <- findCorrelation(cor(select(train_tree_KNN_c,!brand_code)), cutoff = .75)
train_bevnet2 <- train_tree_KNN_c[, -cutoff]
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
 .size = c(1:10),
 ## The next option is to use bagging (see the
 ## next chapter) instead of different random
 ## seeds.
 .bag = FALSE)

set.seed(100)
nnetTune <- train(ph~. ,
                  data = train_tree_KNN,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl =tuning,
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(train_bevnet) + 1)+10 +1,
                  maxit = 1000,
                  allowParallel =T
                  )

set.seed(100)
nnetTune2 <- train(ph~. ,
                  data = train_tree_KNN_c,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl =tuning,
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(train_bevnet2) + 1)+10 +1,
                  maxit = 1000,
                  allowParallel =T
                  )

saveRDS(nnetTune,"Models/nnetTune.rds") 
saveRDS(nnetTune2,"Models/nnetTune2.rds")

nnetTune <- readRDS("Models/nnetTune.rds")
nnetTune2 <-  readRDS("Models/nnetTune2.rds")

postResample(predict(nnetTune,  test_tree_KNN), test_tree_KNN$ph)

     RMSE  Rsquared       MAE 
0.6123699 0.6099018 0.4637191

postResample(predict(nnetTune2,  test_tree_KNN_c), test_tree_KNN_c$ph)

     RMSE  Rsquared       MAE 
0.6220944 0.5978055 0.4759819

3.3.3 MARS

Next we tested a MARS model. It would not appear to have handled this data very well, it did much worse than the previous two models. It would seem remvoing highly correlated predictors helped this model greatly.

set.seed(122)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(ph~. ,
                  data = train_tree_KNN,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = tuning)
marsTuned2 <- train(ph~. ,
                  data = train_tree_KNN_c,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = tuning)

saveRDS(marsTuned,"Models/marsTuned.rds") 
saveRDS(marsTuned2,"Models/marsTuned2.rds")

marsTuned <- readRDS("Models/marsTuned.rds")
marsTuned2 <-  readRDS("Models/marsTuned2.rds")

  postResample(predict(marsTuned,  test_tree_KNN), test_tree_KNN$ph)

     RMSE  Rsquared       MAE 
0.7506974 0.4146177 0.5815867

  postResample(predict(marsTuned2,  test_tree_KNN_c), test_tree_KNN_c$ph)

     RMSE  Rsquared       MAE 
0.6871890 0.5145190 0.5371954

3.3.4 Support Vector Machine

Finally, we tested a support vector machine. It appears to also handle the data very well, scoring the second best out of our non-linear models.

svmRTuned <- train(ph~. ,
                  data = train_tree_KNN,
                   method = "svmRadial",
                   tuneLength = 10,
                   trControl = tuning)
svmRTuned2 <- train(ph~. ,
                  data = train_tree_KNN_c,
                   method = "svmRadial",
                   tuneLength = 10,
                   trControl = tuning)

saveRDS(svmRTuned,"Models/svmRTuned.rds") 
saveRDS(svmRTuned2,"Models/svmRTuned2.rds")

svmRTuned <- readRDS("Models/svmRTuned.rds")
svmRTuned2 <-  readRDS("Models/svmRTuned2.rds")

  postResample(predict(svmRTuned,  test_tree_KNN), test_tree_KNN$ph)

     RMSE  Rsquared       MAE 
0.6453452 0.5699993 0.4746100

  postResample(predict(svmRTuned2,  test_tree_KNN_c), test_tree_KNN_c$ph)

     RMSE  Rsquared       MAE 
0.6444957 0.5682748 0.4720243

3.4 Tree-based Models

3.4.1 Gradient-Boosted

A few of the hypermarkets that were tuned, we spent some time discovering the optimal values for this model.

gbmGrid = expand.grid(.interaction.depth = c(8, 12, 16, 20),
                      .n.trees = c(1500, 2000, 3000),
                      .shrinkage = c(.005, .01, .02, .03, .04),
                      .n.minobsinnode = c(4, 5, 6, 7, 9))

set.seed(624)

gbmCV = train(ph~. ,
                  data = train_imputeKNN2,
              method = 'gbm', 
              tuneGrid = gbmGrid, 
              distribution = 'gaussian', 
              trControl = tuning, 
              verbose = F)

 saveRDS("Models/gbmCVb.rds")

savedGbmC <- readRDS("Models/gbmCVb.rds")

This final GBM model only used 2000 trees, which had better results than 1500 trees or 3000 trees. The interaction depth was also limited to 16 and 20, although 8 and 12 were previously tried, less successfully. 6 was the optimal number for minimum observations in a node, and the shrinkage improved as it got higher in previous model fits, up to .03 but not beyond.

postResample(predict(savedGbmC, as.data.frame(test_imputeKNN2)), test_imputeKNN2$ph)

     RMSE  Rsquared       MAE 
0.5423767 0.6939894 0.4030513

The \(R^2\) is better on the validation set than on the averages of the CV-folds, suggesting that it was a somewhat “lucky” split.

3.4.2 Random Forest

rfGrid = expand.grid(.mtry = c(15, 20, 25))
set.seed(624)
rfTune = train(trainXb, trainYb, method = 'rf', ntree = 2000,
               tuneGrid = rfGrid, trControl = tuning)

saveRDS("Models/rfTune.rds")

savedRfTune <- readRDS("Models/rfTune.rds")

It’s interesting that the mtry hyperparameter kept improving the results as it got higher. Remember, there are only 32 predictors here, and the default suggestion for rf models is 1/3 of that number. The fact that 25 was the best out of the values tried shows that it’s possible that choosing between all 32 predictors at every split might have done even better. This would defeat one main purpose of random forests, which is to avoid fitting a model to the noise in the data, by forcing the trees to use different variables at more splits. One interpretation of the high mtry fit is that the data aren’t very noisy, although we are still only getting .68 \(R^2\) here, so under that low-noise interpretation, we must be missing a lot of predictive information, or at least be biased/underfit.

postResample(predict(savedRfTune, test_imputeKNN2), test_imputeKNN2$ph)

     RMSE  Rsquared       MAE 
0.5157108 0.7328671 0.3825878

3.4.3 Cubist

Finally, we tested the cubist model. It is very simple in construction compared to the other tree models, but manages to do an excellent job. It would appear to have an almost equal score with the previous random forest model while taking a fraction of the time to be computed.

set.seed(1289)
cubistMo <- train(ph ~ ., data = train_imputeKNN2,
                     method = 'cubist',
                       trControl = tuning)

postResample(predict(cubistMo, test_imputeKNN2), test_imputeKNN2$ph)

     RMSE  Rsquared       MAE 
0.5010956 0.7390457 0.3758376

4 Comparing Models

In this section we will compare the top best models from our three modeling categories, and analyzing the top most important variables for each model.

For the linear models we found that Ridge Regression performed best. The best performing non-linear model was the neural network. Finally, our best performing Tree model was our Random Forest model.

It would appear that both the Neural Network and Ridge Regression Mode share the most top variables, with oxygen_filler being the most important variable for both. While it was not the top most important variable for the Random Forest model, it was still important. Likewise, the top random forest variable, mnf_flow, was also important for both the other models. Interestingly, brand code was very important variable for the Random Forest, but not the other models. Likely, the change in how the missing data was handled caused this to occur. What is interesting is that, the bottom most variable for random forest is fairly important to both the other models, pressure_setpoint. We also see pc_volume was unimportant to both the Ridge Regression and Neural Network Model, but it has value to the random forest model. It

plot(varImp(ridge_model),main="Ridge Regression Model Important Variables")

plot(varImp(nnetTune),main="Neural Network Important Variables")

plot(varImp(savedRfTune),main="Random Forest Important Variables")

We also produce predictions with each of our top performing models. The code used to create our predictions can be viewed below.

tree_eval <- select(tree_eval,!ph)
tree_eval_alt <- select(tree_eval_alt,!ph)
 ridge <- tree_eval
nnet <- tree_eval
rf <- tree_eval_alt

  predicts <- map2(list(ridge,nnet,rf),list(ridge_model,nnetTune,savedRfTune) ,
                   ~ .x %>%
                        mutate(ph = predict(.y,.x) )
  )

adjusted <- map2(predicts,list(impute,impute,impute2) ,
  ~.x %>%
     mutate(ph = (ph * .y$std['ph']) + .y$mean['ph'] ) %>%
    relocate(ph) %>%
    as.data.frame()
  
)



map2(adjusted,c("Ridge Model","Neural Net Model","Random Forest Model") ,
     
     ~write.xlsx(.x,file='Data/Evaluation_Results.xlsx',sheetName = .y,append = T,row.names = FALSE)
     
     )

5 Conclusion

For this assignment our group was tasked with understanding the beverage process here at ABC Beverages. We broke down the predictive factors within the historic data we were given, finding the various statistical properties of each. We investigated and sought to understand the nature of the missing variables, allowing us to devise a proper strategy for handling them. We discovered and dealt with any potential co-linearity issues. Using all this information, we created a cleaned set of training and test data that could be used for modeling. We decided to find the best model from within linear, non-linear, and tree models. After deciding on the best performing models for each category, we analyzed the top most important variables for each category. With this information, it will hopefully allow for our company to better understand which factors are important in predicting the PH of a beverage. We will also include a report that goes into more detail for each of the top important variables that we discovered. We thank you very much for taking the time to read our report.

Data 624 Project Two

Zachary Safir, Ethan Haley, Stefano Biguzzi

5/11/2022