pacman::p_load(MASS,tidyverse,DataExplorer,skimr,Amelia,naniar,RANN,caret,car,janitor,
missForest,caTools,earth,finalfit,inspectdf,xlsx)
df <- readxl::read_xlsx("Data/StudentData.xlsx")
eval <- readxl::read_xlsx("Data/StudentEvaluation.xlsx")
For this assignment our group was tasked with understanding the manufacturing process at ABC Beverages. Using historical data, we seek to build predictive models that will illuminate the key processes in predicting PH. The report here will be technical in nature, we shall assume that the reader is familiar with the statistical terminology and the various models used in this report. For a simpler, non technical breakdown of our findings, a non technical report will also be included. That report will also go into more detail over the top most important predictors. For this report we will be exploring the structure of the data. The focus will be on building Linear, Non-Linear, and Tree models and finding best predictive models for each. We will then analyze which variables were the most important to those models.
We begin by exploring the data itself. Shown is the historical data set we will be using. It is clear on initial inspection that there are missing values and that all our columns are numeric besides the one categorical Brand Code column.
df
df$`Brand Code` <- as.factor(df$`Brand Code`)
eval$`Brand Code` <- as.factor(eval$`Brand Code`)
We further inspect the values, discovering various key features. To begin with, it would appear the majority of our columns are missing at least one or more values. Fortunately, there are no columns with an excessive amount of missingness with the worst example still having a completeness rate of .918. We also note that the most common Brand code is B, which makes up roughly half the data making it way more common than any other brand in this data set. It would also appear that many of numeric values are either very skewed, or have very little variance.
skim(df)
| Name | df |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | FALSE | 4 | B: 1239, D: 615, C: 304, A: 293 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
Looking at some of the top missing values, it would not appear as if a clear pattern for the missingness exists. We note that the data is most likely missing at random. We also run a statistical to be certain that the data is not missing completely at random. Given the high statistic value and low p-value, we can conclude the data is not missing completely at random.
As an additional note, we also chose to remove the 4 missing PH values. As our objective was to predict PH itself, it did not make sense to impute those values and with how few are missing, we decided it wouldn’t change the results in any significant way.
df %>%
missing_pattern('PH',c('MFR','Brand Code','Carb Pressure1','PSC CO2','Filler Speed','PC Volume'))
PH Carb Pressure1 PC Volume PSC CO2 Filler Speed Brand Code MFR
2160 1 1 1 1 1 1 1 0
138 1 1 1 1 1 1 0 1
100 1 1 1 1 1 0 1 1
10 1 1 1 1 1 0 0 2
3 1 1 1 1 0 1 1 1
47 1 1 1 1 0 1 0 2
3 1 1 1 1 0 0 0 3
31 1 1 1 0 1 1 1 1
5 1 1 1 0 1 1 0 2
2 1 1 1 0 1 0 1 2
26 1 1 0 1 1 1 1 1
4 1 1 0 1 1 1 0 2
4 1 1 0 1 1 0 1 2
1 1 1 0 1 0 1 0 3
1 1 1 0 0 1 1 1 2
28 1 0 1 1 1 1 1 1
1 1 0 1 1 1 0 1 2
3 1 0 0 1 1 1 1 2
1 0 1 1 1 1 1 0 2
3 0 1 1 1 0 1 0 3
4 32 39 39 57 120 212 503
df <- df[!is.na(df$PH),]
mcar_test(df)
Looking at our values by brand, we do find some values are different in distribution when broken up by brand. The missing values are not clearly one brand or another. However, with how skewed some of the numeric values are, it is still possible that these values all mostly one brand or another.
plot_boxplot(df,by="Brand Code")
corr_simple <- function(data=df,sig=0.5){
library(corrplot)
#convert data to numeric in order to run correlations
#convert to factor first to keep the integrity of the data - each value will become a number rather than turn into NA
df_cor <- data %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric) #run a correlation and drop the insignificant ones
corr <- cor(df_cor)
#prepare to drop duplicates and correlations of 1
corr[lower.tri(corr,diag=TRUE)] <- NA
#drop perfect correlations
corr[corr == 1] <- NA #turn into a 3-column table
corr <- as.data.frame(as.table(corr))
#remove the NA values from above
corr <- na.omit(corr) #select significant values
corr <- subset(corr, abs(Freq) > sig)
#sort by highest correlation
corr <- corr[order(-abs(corr$Freq)),] #print table
print(corr) #turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
#plot correlations visually
corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ",method =c('number'))
}
Next, we look at the top most correlated values in our data. With how many cross correlations there are within this data set, we decided it was prudent to only examine the top most correlations. Doing so shows that many of the predictors are highly correlated with one another.
corr_simple(na.omit(df))
Var1 Var2 Freq
1080 Balling Balling Lvl 0.9884121
908 Filler Level Bowl Setpoint 0.9772417
1078 Density Balling Lvl 0.9564173
781 Density Balling 0.9531276
744 Filler Speed MFR 0.9512644
1087 Alch Rel Balling Lvl 0.9473433
1014 Balling Alch Rel 0.9457260
1012 Density Alch Rel 0.9229932
476 Hyd Pressure2 Hyd Pressure3 0.9176619
1054 Alch Rel Carb Rel 0.8827915
1088 Carb Rel Balling Lvl 0.8721483
1045 Density Carb Rel 0.8603222
1047 Balling Carb Rel 0.8589906
1025 Carb Volume Carb Rel 0.8410722
992 Carb Volume Alch Rel 0.8247876
170 Carb Pressure Carb Temp 0.8243350
761 Carb Volume Balling 0.8187788
1058 Carb Volume Balling Lvl 0.8179933
936 Fill Pressure Pressure Setpoint 0.8152856
695 Carb Volume Density 0.8018088
472 Mnf Flow Hyd Pressure3 0.7641428
442 Hyd Pressure1 Hyd Pressure2 0.7118513
1006 Hyd Pressure4 Alch Rel -0.6855760
807 Hyd Pressure3 Pressure Vacuum -0.6722583
991 Brand Code Alch Rel 0.6666969
439 Mnf Flow Hyd Pressure2 0.6483379
806 Hyd Pressure2 Pressure Vacuum -0.6281675
496 Brand Code Hyd Pressure4 -0.6248840
901 Mnf Flow Bowl Setpoint -0.6115804
538 Mnf Flow Filler Level -0.6083807
475 Hyd Pressure1 Hyd Pressure3 0.6055890
775 Hyd Pressure4 Balling -0.5949966
709 Hyd Pressure4 Density -0.5843820
1039 Hyd Pressure4 Carb Rel -0.5818275
1072 Hyd Pressure4 Balling Lvl -0.5750443
497 Carb Volume Hyd Pressure4 -0.5679086
637 Mnf Flow Usage cont 0.5601926
802 Mnf Flow Pressure Vacuum -0.5599433
868 Mnf Flow Oxygen Filler -0.5500664
1024 Brand Code Carb Rel 0.5375376
34 Brand Code Carb Volume 0.5337574
906 Hyd Pressure3 Bowl Setpoint -0.5292159
543 Hyd Pressure3 Filler Level -0.5279659
760 Brand Code Balling 0.5007730
We calculate the variance-inflation and generalized variance-inflation factors. Doing so shows that there are possible collinearity issues among our predictors. This is not necessarily an issue for all models, but make careful note of this regardless.
vif(lm(PH~.,df,na.action = na.exclude))
GVIF Df GVIF^(1/(2*Df))
`Brand Code` 192.548930 3 2.403017
`Carb Volume` 18.044215 1 4.247848
`Fill Ounces` 1.190455 1 1.091080
`PC Volume` 1.653163 1 1.285754
`Carb Pressure` 42.681219 1 6.533087
`Carb Temp` 35.506521 1 5.958735
PSC 1.164174 1 1.078969
`PSC Fill` 1.112638 1 1.054817
`PSC CO2` 1.069956 1 1.034387
`Mnf Flow` 5.077635 1 2.253361
`Carb Pressure1` 1.485310 1 1.218733
`Fill Pressure` 3.804648 1 1.950551
`Hyd Pressure1` 3.065348 1 1.750813
`Hyd Pressure2` 10.892929 1 3.300444
`Hyd Pressure3` 12.954982 1 3.599303
`Hyd Pressure4` 3.122824 1 1.767151
`Filler Level` 25.175483 1 5.017518
`Filler Speed` 11.191642 1 3.345391
Temperature 1.343282 1 1.159000
`Usage cont` 1.768304 1 1.329776
`Carb Flow` 2.348573 1 1.532506
Density 17.059399 1 4.130303
MFR 11.479327 1 3.388116
Balling 175.704876 1 13.255372
`Pressure Vacuum` 4.291585 1 2.071614
`Oxygen Filler` 1.567662 1 1.252063
`Bowl Setpoint` 25.827609 1 5.082087
`Pressure Setpoint` 3.597044 1 1.896587
`Air Pressurer` 1.242666 1 1.114749
`Alch Rel` 52.309517 1 7.232532
`Carb Rel` 7.657658 1 2.767247
`Balling Lvl` 210.175794 1 14.497441
The modeling process begins with cleaning the names of our predictor values to make the modeling process easier. We then create a train and test split. We do so prior to imputing our variables so that we can create a hopefully unbiased set of test data that can be used to verify how our models function.
df_clean <- clean_names(df)
eval_clean <- clean_names(eval)
set.seed(125)
sample <- sample.split(df_clean$ph, SplitRatio = 0.8)
train <- subset(df_clean, sample == TRUE)
test <- subset(df_clean, sample == FALSE)
We created two main versions of data to build and test our models with. The first training set has missing numeric values filled in using KNN Impute and the we use missing forest to fill in the missing categorical values. The second set of data uses the same process for imputation, but uses the corr method to remove any columns that are overly correlated with one another.
set.seed(122)
impute <- preProcess(as.data.frame(train), method = "knnImpute")
train_imputeKNN <- predict(impute, train)
train_tree_KNN_impute <- missForest(as.data.frame(train_imputeKNN))
train_tree_KNN <- train_tree_KNN_impute$ximp
test_imputeKNN <- predict(impute, test)
test_tree_KNN_impute <- missForest(as.data.frame(test_imputeKNN))
test_tree_KNN <- test_tree_KNN_impute$ximp
impute_eval <- predict(impute, eval_clean)
tree_eval <- missForest(as.data.frame(impute_eval))$ximp
set.seed(122)
impute_c <- preProcess(as.data.frame(train), method = c("knnImpute","corr"))
train_imputeKNN_c <- predict(impute_c, as.data.frame(train))
train_tree_KNN_impute_c <- missForest(as.data.frame(train_imputeKNN_c))
train_tree_KNN_c <- train_tree_KNN_impute_c$ximp
test_imputeKNN_c <- predict(impute_c, as.data.frame(test))
test_tree_KNN_impute_c <- missForest(as.data.frame(test_imputeKNN_c))
test_tree_KNN_c <- test_tree_KNN_impute_c$ximp
A third set of imputation data was created as well. This data used KNN imputation for numeric values, but we filled in the missing Brand Values with an N. We found that this data worked best for the Tree models, so it was chosen to be used for those models.
# Keep same train/test split as above models, but preprocess slightly differently.
set.seed(122)
# Copy the d.f's so as not to mess with future predictions for previous models
train2 = train
test2 = test
eval2 = eval_clean
# Convert NA brand_code to a new brand "N" instead of KNN-imputing it.
levels(train2$brand_code) = c(levels(train2$brand_code), "N")
levels(test2$brand_code) = c(levels(test2$brand_code), "N")
levels(eval2$brand_code) = c(levels(test2$brand_code), "N")
train2$brand_code[is.na(train2$brand_code)] = "N"
test2$brand_code[is.na(test2$brand_code)] = "N"
eval2$brand_code[is.na(eval2$brand_code)] = "N"
# Now use same imputation on remaining NA's
impute2 <- preProcess(as.data.frame(train2), method = "knnImpute")
train_imputeKNN2 <- predict(impute2, train2)
# Note: using training imputer to impute test NA's
test_imputeKNN2 <- predict(impute2, test2)
tree_eval_alt <- predict(impute2,eval2)
lm_cols <- stepAIC(lm(ph ~ ., train_tree_KNN), trace=F)
set.seed(128)
lm_model <- train(train_tree_KNN[,colnames(lm_cols$model)[colnames(lm_cols$model) != "ph"]],
train_tree_KNN$ph,
method='lm',metric='Rsquared', trControl = ctrl)
resample_lm1 <- postResample(predict(lm_model$finalModel, test_tree_KNN), test_tree_KNN$ph)
We start with the linear model and use a backward step-wise
regression model using the tree_KNN pre-processed data. The
model gives us the following metrics: RMSE - 0.765, R-squared - 0.419,
and MAE - 0.595. Below, we check the residual plots to make sure that
there are no patterns:
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(lm_model$finalModel)
The test data for this model gives us the following metrics: RMSE -
0.774, R-squared - 0.378, and MAE - 0.602. We see that Linear Model 1
slightly over fits the PH data since it does not predict the PH values
in the test data as well as in the training data. Next, we attempt a
second model using the train_tree_KNN_c data. Again we use
a step wise regression model, which gives us the following metrics:
lm2_col <- stepAIC(lm(ph~.,train_tree_KNN_c),trace=F)
set.seed(120)
lm_model2 <- train(train_tree_KNN_c[,colnames(lm2_col$model)[colnames(lm2_col$model) != "ph"]],
train_tree_KNN_c$ph,
method='lm',metric='Rsquared', trControl = ctrl)
resample_lm2 <- postResample(predict(lm_model2$finalModel,test_tree_KNN_c), test_tree_KNN_c$ph)
RMSE - 0.775, R-squared - 0.403, and MAE - 0.606. Below, we check the residual plots to make sure that there are no patterns:
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(lm_model2$finalModel)
The test data for this model gives us the following metrics: RMSE - 0.778, R-squared - 0.371, and MAE - 0.605. We see that Linear Model 2 also slightly over fits the PH data since it does not predict the PH values in the test data as well as in the training data. Given that Linear Model 1 performs better on the training and test data there doesn’t seem to be a need to pre-process the data by removing the highly correlated variables. The residual plot for Linear Model 2 looked just a little bit better, but not enough to warrant discarding Linear Model 1.
We attempt to get a higher R-squared by using different linear regression models on both data sets. The first model type we attempt is the Partial Least Square (PLS) model. Below we see the results of training a model from 1 component to 15 components.
set.seed(120)
PLS_model <- train(ph~.,
data=train_tree_KNN,
method='pls',metric='Rsquared',
tuneLength = 15, trControl = ctrl)
PLS_model1 <- train(ph~.,
data=train_tree_KNN_c,
method='pls',metric='Rsquared',
tuneLength = 15, trControl = ctrl)
plot(PLS_model,main=latex2exp::TeX("PLS model 1 $R^2$"))
plot(PLS_model1,main=latex2exp::TeX("PLS model 2 $R^2$"))
resample_PLS1 <- postResample(predict(PLS_model,test_tree_KNN),test_tree_KNN$ph)
resample_PLS2 <- postResample(predict(PLS_model1,test_tree_KNN_c),test_tree_KNN_c$ph)
PLS Model 1 achieves it’s highest R-squared of 0.4138 after using 14 components. PLS Model 2, on the other hand, achieves its highest R-squared of 0.3994 after using 14 components.
The re-sampled R2 for PLS was 0.3813 while PLS model 2 had a resampled R2 of 0.3811. Given that the PLS 1 model performed better than the Linear models and the PLS 2 model on the test data, we will consider this the best Linear model so far. Next we will try a ridge regression.
Given that there seem to be many highly correlated predictors, we decided to try a penalized regression model as a way to mitigate these correlations. Below we see the plots that show which lambda produces the highest R-squared.
set.seed(120)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 10))
ridge_model <- train(ph~.,data=train_tree_KNN,
method = "ridge",metric='Rsquared',
tuneGrid = ridgeGrid,trControl = ctrl)
ridge_model1 <- train(ph~.,data=train_tree_KNN_c,
method = "ridge",metric='Rsquared',
tuneGrid = ridgeGrid,trControl = ctrl)
plot(ridge_model,main=latex2exp::TeX("Ridge model 1 $R^2$"))
plot(ridge_model1,main=latex2exp::TeX("Ridge model 2 $R^2$"))
resample_Ridge1 <- postResample(predict(ridge_model,test_tree_KNN),test_tree_KNN$ph)
resample_Ridge2 <- postResample(predict(ridge_model1,test_tree_KNN_c),test_tree_KNN_c$ph)
For both Ridge Model 1 and Ridge Model 2 a lambda of 0.011 gives us the best R-squared for each model of 0.4133 and 0.3975, respectively. The R2 for the test data on the Ridge Regression 1 was 0.3838, while the for Ridge Regression 2 it was 0.3825.
Given that these re-sampled R-squared values for the Ridge Regression 1 model are higher than the PLS model 1 and therefore higher than any of the OLS regression, we will be selecting the Ridge Regression 1 as our best linear regression model.
tuning = trainControl(
method = "cv", number = 5)
For our non-linear modeling, we began with the K-nearest neighbors algorithm. We see the result of the model below, it appears to perform better than the linear models tested in the previous section.
set.seed(124)
knnModel <- train( ph~.,
data = train_tree_KNN,
method = "knn",
trControl = tuning,
tuneLength = 10)
knnModel2 <- train( ph~.,
data = train_tree_KNN_c,
method = "knn",
trControl = tuning,
tuneLength = 10)
postResample(predict(knnModel, test_tree_KNN), test_tree_KNN$ph)
RMSE Rsquared MAE
0.6791006 0.5250011 0.5177956
postResample(predict(knnModel2, test_tree_KNN_c), test_tree_KNN_c$ph)
RMSE Rsquared MAE
0.6811649 0.5241601 0.5180207
For the next non-linear model, we tested a neural network. We found the below parameters to be the most optimal for training our net. It would appear that this model performed much better than KNN. It took much longer to load however, so we saved the model to be reloaded in when needed.
cutoff <- findCorrelation(cor(select(train_tree_KNN,!brand_code)), cutoff = .75)
train_bevnet <- train_tree_KNN[, -cutoff]
cutoff2 <- findCorrelation(cor(select(train_tree_KNN_c,!brand_code)), cutoff = .75)
train_bevnet2 <- train_tree_KNN_c[, -cutoff]
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
## The next option is to use bagging (see the
## next chapter) instead of different random
## seeds.
.bag = FALSE)
set.seed(100)
nnetTune <- train(ph~. ,
data = train_tree_KNN,
method = "avNNet",
tuneGrid = nnetGrid,
trControl =tuning,
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train_bevnet) + 1)+10 +1,
maxit = 1000,
allowParallel =T
)
set.seed(100)
nnetTune2 <- train(ph~. ,
data = train_tree_KNN_c,
method = "avNNet",
tuneGrid = nnetGrid,
trControl =tuning,
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train_bevnet2) + 1)+10 +1,
maxit = 1000,
allowParallel =T
)
saveRDS(nnetTune,"Models/nnetTune.rds")
saveRDS(nnetTune2,"Models/nnetTune2.rds")
nnetTune <- readRDS("Models/nnetTune.rds")
nnetTune2 <- readRDS("Models/nnetTune2.rds")
postResample(predict(nnetTune, test_tree_KNN), test_tree_KNN$ph)
RMSE Rsquared MAE
0.6123699 0.6099018 0.4637191
postResample(predict(nnetTune2, test_tree_KNN_c), test_tree_KNN_c$ph)
RMSE Rsquared MAE
0.6220944 0.5978055 0.4759819
Next we tested a MARS model. It would not appear to have handled this data very well, it did much worse than the previous two models. It would seem remvoing highly correlated predictors helped this model greatly.
set.seed(122)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(ph~. ,
data = train_tree_KNN,
method = "earth",
tuneGrid = marsGrid,
trControl = tuning)
marsTuned2 <- train(ph~. ,
data = train_tree_KNN_c,
method = "earth",
tuneGrid = marsGrid,
trControl = tuning)
saveRDS(marsTuned,"Models/marsTuned.rds")
saveRDS(marsTuned2,"Models/marsTuned2.rds")
marsTuned <- readRDS("Models/marsTuned.rds")
marsTuned2 <- readRDS("Models/marsTuned2.rds")
postResample(predict(marsTuned, test_tree_KNN), test_tree_KNN$ph)
RMSE Rsquared MAE
0.7506974 0.4146177 0.5815867
postResample(predict(marsTuned2, test_tree_KNN_c), test_tree_KNN_c$ph)
RMSE Rsquared MAE
0.6871890 0.5145190 0.5371954
Finally, we tested a support vector machine. It appears to also handle the data very well, scoring the second best out of our non-linear models.
svmRTuned <- train(ph~. ,
data = train_tree_KNN,
method = "svmRadial",
tuneLength = 10,
trControl = tuning)
svmRTuned2 <- train(ph~. ,
data = train_tree_KNN_c,
method = "svmRadial",
tuneLength = 10,
trControl = tuning)
saveRDS(svmRTuned,"Models/svmRTuned.rds")
saveRDS(svmRTuned2,"Models/svmRTuned2.rds")
svmRTuned <- readRDS("Models/svmRTuned.rds")
svmRTuned2 <- readRDS("Models/svmRTuned2.rds")
postResample(predict(svmRTuned, test_tree_KNN), test_tree_KNN$ph)
RMSE Rsquared MAE
0.6453452 0.5699993 0.4746100
postResample(predict(svmRTuned2, test_tree_KNN_c), test_tree_KNN_c$ph)
RMSE Rsquared MAE
0.6444957 0.5682748 0.4720243
A few of the hypermarkets that were tuned, we spent some time discovering the optimal values for this model.
gbmGrid = expand.grid(.interaction.depth = c(8, 12, 16, 20),
.n.trees = c(1500, 2000, 3000),
.shrinkage = c(.005, .01, .02, .03, .04),
.n.minobsinnode = c(4, 5, 6, 7, 9))
set.seed(624)
gbmCV = train(ph~. ,
data = train_imputeKNN2,
method = 'gbm',
tuneGrid = gbmGrid,
distribution = 'gaussian',
trControl = tuning,
verbose = F)
saveRDS("Models/gbmCVb.rds")
savedGbmC <- readRDS("Models/gbmCVb.rds")
This final GBM model only used 2000 trees, which had better results than 1500 trees or 3000 trees. The interaction depth was also limited to 16 and 20, although 8 and 12 were previously tried, less successfully. 6 was the optimal number for minimum observations in a node, and the shrinkage improved as it got higher in previous model fits, up to .03 but not beyond.
postResample(predict(savedGbmC, as.data.frame(test_imputeKNN2)), test_imputeKNN2$ph)
RMSE Rsquared MAE
0.5423767 0.6939894 0.4030513
The \(R^2\) is better on the validation set than on the averages of the CV-folds, suggesting that it was a somewhat “lucky” split.
rfGrid = expand.grid(.mtry = c(15, 20, 25))
set.seed(624)
rfTune = train(trainXb, trainYb, method = 'rf', ntree = 2000,
tuneGrid = rfGrid, trControl = tuning)
saveRDS("Models/rfTune.rds")
savedRfTune <- readRDS("Models/rfTune.rds")
It’s interesting that the mtry hyperparameter kept
improving the results as it got higher. Remember, there are only 32
predictors here, and the default suggestion for rf models
is 1/3 of that number. The fact that 25 was the best out of the values
tried shows that it’s possible that choosing between all 32 predictors
at every split might have done even better. This would defeat one main
purpose of random forests, which is to avoid fitting a model to the
noise in the data, by forcing the trees to use different variables at
more splits. One interpretation of the high mtry fit is
that the data aren’t very noisy, although we are still only getting .68
\(R^2\) here, so under that low-noise
interpretation, we must be missing a lot of predictive information, or
at least be biased/underfit.
postResample(predict(savedRfTune, test_imputeKNN2), test_imputeKNN2$ph)
RMSE Rsquared MAE
0.5157108 0.7328671 0.3825878
Finally, we tested the cubist model. It is very simple in construction compared to the other tree models, but manages to do an excellent job. It would appear to have an almost equal score with the previous random forest model while taking a fraction of the time to be computed.
set.seed(1289)
cubistMo <- train(ph ~ ., data = train_imputeKNN2,
method = 'cubist',
trControl = tuning)
postResample(predict(cubistMo, test_imputeKNN2), test_imputeKNN2$ph)
RMSE Rsquared MAE
0.5010956 0.7390457 0.3758376
In this section we will compare the top best models from our three modeling categories, and analyzing the top most important variables for each model.
For the linear models we found that Ridge Regression performed best. The best performing non-linear model was the neural network. Finally, our best performing Tree model was our Random Forest model.
It would appear that both the Neural Network and Ridge Regression Mode share the most top variables, with oxygen_filler being the most important variable for both. While it was not the top most important variable for the Random Forest model, it was still important. Likewise, the top random forest variable, mnf_flow, was also important for both the other models. Interestingly, brand code was very important variable for the Random Forest, but not the other models. Likely, the change in how the missing data was handled caused this to occur. What is interesting is that, the bottom most variable for random forest is fairly important to both the other models, pressure_setpoint. We also see pc_volume was unimportant to both the Ridge Regression and Neural Network Model, but it has value to the random forest model. It
plot(varImp(ridge_model),main="Ridge Regression Model Important Variables")
plot(varImp(nnetTune),main="Neural Network Important Variables")
plot(varImp(savedRfTune),main="Random Forest Important Variables")
We also produce predictions with each of our top performing models. The code used to create our predictions can be viewed below.
tree_eval <- select(tree_eval,!ph)
tree_eval_alt <- select(tree_eval_alt,!ph)
ridge <- tree_eval
nnet <- tree_eval
rf <- tree_eval_alt
predicts <- map2(list(ridge,nnet,rf),list(ridge_model,nnetTune,savedRfTune) ,
~ .x %>%
mutate(ph = predict(.y,.x) )
)
adjusted <- map2(predicts,list(impute,impute,impute2) ,
~.x %>%
mutate(ph = (ph * .y$std['ph']) + .y$mean['ph'] ) %>%
relocate(ph) %>%
as.data.frame()
)
map2(adjusted,c("Ridge Model","Neural Net Model","Random Forest Model") ,
~write.xlsx(.x,file='Data/Evaluation_Results.xlsx',sheetName = .y,append = T,row.names = FALSE)
)
For this assignment our group was tasked with understanding the beverage process here at ABC Beverages. We broke down the predictive factors within the historic data we were given, finding the various statistical properties of each. We investigated and sought to understand the nature of the missing variables, allowing us to devise a proper strategy for handling them. We discovered and dealt with any potential co-linearity issues. Using all this information, we created a cleaned set of training and test data that could be used for modeling. We decided to find the best model from within linear, non-linear, and tree models. After deciding on the best performing models for each category, we analyzed the top most important variables for each category. With this information, it will hopefully allow for our company to better understand which factors are important in predicting the PH of a beverage. We will also include a report that goes into more detail for each of the top important variables that we discovered. We thank you very much for taking the time to read our report.