This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
training<-read_xlsx("StudentData.xlsx")
testing<-read_xlsx("StudentEvaluation.xlsx") %>%
mutate(PH = as.numeric(PH))
An initial glance at this dataset shows that our training set consists of a dimension of 2571 x 33 (rows x columns). Most of the features within this dataset are of a numeric nature with just 1 feature Brand Code
being a categorical variable.
skim(training)
Name | training |
Number of rows | 2571 |
Number of columns | 33 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 32 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | <U+2581><U+2586><U+2587><U+2585><U+2581> |
Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | <U+2581><U+2582><U+2587><U+2582><U+2581> |
PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | <U+2581><U+2585><U+2587><U+2583><U+2581> |
Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | <U+2581><U+2585><U+2587><U+2583><U+2581> |
PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | <U+2586><U+2587><U+2583><U+2581><U+2581> |
PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | <U+2586><U+2587><U+2583><U+2581><U+2581> |
PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | <U+2587><U+2585><U+2582><U+2581><U+2581> |
Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | <U+2587><U+2581><U+2581><U+2587><U+2582> |
Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | <U+2581><U+2581><U+2587><U+2582><U+2581> |
Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | <U+2587><U+2585><U+2582><U+2581><U+2581> |
Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | <U+2587><U+2582><U+2587><U+2585><U+2581> |
Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | <U+2587><U+2581><U+2583><U+2587><U+2581> |
Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | <U+2581><U+2583><U+2585><U+2587><U+2581> |
Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | <U+2581><U+2583><U+2585><U+2583><U+2587> |
Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | <U+2582><U+2585><U+2586><U+2587><U+2581> |
Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | <U+2581><U+2585><U+2587><U+2582><U+2586> |
MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | <U+2581><U+2581><U+2581><U+2582><U+2587> |
Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | <U+2581><U+2587><U+2587><U+2581><U+2587> |
Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | <U+2582><U+2587><U+2586><U+2582><U+2581> |
PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | <U+2581><U+2585><U+2587><U+2582><U+2581> |
Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | <U+2581><U+2582><U+2583><U+2587><U+2581> |
Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | <U+2581><U+2587><U+2581><U+2586><U+2581> |
Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | <U+2585><U+2587><U+2581><U+2581><U+2581> |
Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | <U+2581><U+2587><U+2582><U+2583><U+2581> |
Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | <U+2581><U+2587><U+2587><U+2582><U+2581> |
Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | <U+2581><U+2587><U+2582><U+2581><U+2586> |
The figure below shows the distribution of the brand codes associated with the dataset, where we observe many a skewness towards brand B
and a few unknown data points.
plot_bar(training)
In terms of numeric data, we can plot the histogram below with the DataExplorer
package and assess the distributions of all of our features including pH. The distribution of these features do differ. pH follows a gaussian distribution which is beneficial for predictions. Many features have slight skewness or bimodal gaussian distributions. We will handle these features via boxcox transformations.
plot_histogram(training,ncol = 3)
Additionally, we want to look at multicolinearity associated with the dataset. because the number of columns is quite large, we will only produce a scatter matrix using features with the highest pearson (linear) correlations. These relationships,you can see there are some strong relationships such as between filler speed vs MFR, and Balling vs Balling level, these features will need to be assessed and potentially aggregated or removed to mitigate noise in the dataset.
trainingnumeric<-training %>%
select(-`Brand Code`)
ADHD_cors<-
bind_rows(
#pearson correlation of numeric features
trainingnumeric %>%
cor(method = "pearson", use = "pairwise.complete.obs") %>% as.data.frame() %>%
rownames_to_column(var = "x") %>%
pivot_longer(cols = -x, names_to = "y", values_to = "correlation") %>%
mutate(cor_type = "pearson"),
#spearman (monotonic) correlations of numeric features
trainingnumeric %>%
cor(method = "spearman", use = "pairwise.complete.obs") %>% as.data.frame() %>%
rownames_to_column(var = "x") %>%
pivot_longer(cols = -x, names_to = "y", values_to = "correlation") %>%
mutate(cor_type = "spearman")
)
topcors<-ADHD_cors %>%
filter(!(x ==y)) %>%
filter(cor_type=="pearson") %>%
#distinct(correlation,.keep_all = T) %>%
arrange(-abs(correlation))%>% #top_n(10, correlation) %>%
head(10)
topcorvars<-c(pull(topcors,x),pull(topcors,y)) %>% unique
topcorset<-training %>%
select(all_of(c(topcorvars,'Brand Code')))
GGally::ggpairs(topcorset, progress = F, aes(color = `Brand Code`), title = "Top Correlations Within Training Set")+ theme_bw()
Since pH will be our main variable of interest, it is a good idea to observe any strong relationships between our predictor variables to pH, this time we will look at top spearman relationships to pH only. spearman provides a correlational strength of monotonicity which can capture both linear and non-linear relationships. We notice based on the figure below that most of the information seems a bit noisy, thus using a predictive machine learning model may provide value in identifying relationships.
topcorsPH<-ADHD_cors %>%
filter(!(x ==y)) %>%
filter(cor_type=="spearman") %>%
#distinct(correlation,.keep_all = T) %>%
arrange(-abs(correlation))%>%
filter(x == "PH" | y == "PH") %>%
head(10)
topcorvarsPH<-c(pull(topcorsPH,x),pull(topcorsPH,y)) %>% unique
topcorsetPH<-training %>%
select(all_of(c(topcorvarsPH,'Brand Code')))
GGally::ggpairs(topcorsetPH, progress = F, aes(color = `Brand Code`), title = "Top Correlations to pH Within Training Set")+ theme_bw()
datatrans<-training %>% recipe(~.) %>%
step_mutate_at(contains(c("hyd pressure", "mnf flow","filler speed","carb flow", "balling")), fn = ~ifelse(.<0,NA,.)) %>%
step_mutate_at(`Carb Volume`,`Filler Level`,PSC, `Air Pressurer`, `Bowl Setpoint`,`Oxygen Filler`,Temperature,MFR, fn = ~BoxCox(., lambda = BoxCox.lambda(.))) %>%
step_mutate_at(`Brand Code`, fn = ~as.factor(.)) %>%
step_impute_knn(all_predictors()) %>%
step_rename(BrandCode=`Brand Code`) %>%
step_dummy(all_nominal()) %>%
step_corr(all_predictors(),threshold = 0.9) %>%
step_nzv(all_predictors()) %>%
prep()
training_cleaned<-
datatrans %>% bake(training)
For the Data preparation following baseline steps were made and the reasons are provided below
Remove any NA values associated with pH: Because pH
will be our response variable in this project, we will omit any observations that does not have pH data (if any).
Removing Negative Values: Removing negative values associated with features where negative values dont make sense such as Carb Volume
, Filler Level
, PSC
, Air Pressurer
, Bowl Setpoint
, and Oxygen Filler
.
BoxCox Transformations: Features need to be normalized such that the distributions follow gaussian curves they are centered and scaled the mean is 0 and the Stdev is 1, this scales all the data to allow kmeans to appropriately place centroids and observations at appropriate distances.
Imputation of missing data with KNN: the remaining data was imputed with K-nearestneighbors (KNN) as a way to fill in missing gaps. alternative methods include median, mean, or bag imputations but it was determined that KNN provides the best results with minimal effect on computational effort.
Dummifying Variables: Categorical variables were binarized into 0/1. This is particularly important for certain algorithms such as linear regression that are unable to understand categorical variables. For example a feature with categories 1,2,3 will binarizing such that for example 3 would be its own column with 0/1 for presence/absence.
Colinearity test: Colinearity was tested and it was determined that there were a few features that were removed due to colinearity. the threshold used was 0.9 on a pearson-based relationship. Below shows the features that were removed due to colinearity
datatrans$steps[[7]]$removals
[1] "Hyd Pressure3" "Balling" "Bowl Setpoint" "Alch Rel"
[5] "Balling Lvl"
Data Exploration
section there doesnt appear to be any clear indicators of a low-variance variable with majority of categories recorded at a particular value. This is confirmed statistically with tidymodels
where no features were removed due to low variance as shown belowdatatrans$steps[[8]]$removals
character(0)
A series of regression models were built to predict pH values in the bottling process. Models were built on the training dataset using the caret
package with 10-fold cross validation to determine optimal values of tuning parameters and were assessed by quantitative comparison of accuracy metrics Root Mean Square Error (RMSE), Mean Average Error (MAE), and R squared. By tuning these hyperparameters, a balance in the bias-variance trad-off can be reached. In addition to making predictions, the variables that most significantly influence pH are identified so that future decisions regarding bottle process streamlining can be effectively made.
The training dataset contains 2751 observations of 32 predictor variables and the target, pH. The test set contains 627 observations. Values of all variables were centered and scaled to facilitate model algorithms and allow for direct comparison of variable importance to a given model.
Models that were considered for this analysis include:
A Partial Least Squares (PLS) model was built due to correlations ween in the predictor variables. The model was tuned across 10 principle components (ncomp) from 1 to 10. ncomp was computed using cross-validation to determine the value that minimized RMSE, and was determined to be ncomp=7.
# tuning paramter ncomp; number of components
set.seed(0515)
pls_mod <- train(x=trainX, y=trainY$PH,
method="pls",
tuneLength=10,
trControl=ctrl)
# plot model tuning
pls_mod %>%
ggplot() +
ggtitle("Results of PLS cross validation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
# make predictions on test set data
pls_pred <- predict(pls_mod, newdata = testX) %>%
as.data.frame()
# get accuracy metrics of test set data
pls_acc <- postResample(pred=pls_pred, obs=testY)
K-nearest neighbors (KNN) regression models predict a new sample using the K-closest samples from the training set. For the model built, a Euclidean distance was used and optimal value for K, the number of neighbors used for each calculation, was determined from cross validation varying 8 (odd numbers from 5 to 19) values of K. The value of K that minimized RMSE was K=7, and was used as the tuning parameter in the model.
# train model; tuning parameter k, number of neighbors
set.seed(111)
knn_mod <- train(x=trainX, y=trainY$PH,
method="knn",
tuneLength=8,
trControl=ctrl)
knn_mod %>%
ggplot() +
ggtitle("Results of KNN cross validation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
# make predictions on test set data
knn_pred <- predict(knn_mod, newdata = testX) %>% as.data.frame()
# get accuracy metrics of test set data
knn_acc <- postResample(pred=knn_pred, obs=testY)
Support Vector Machines (SVM) regression model was tested using a linear (not shown) and radial kernel function. The radial kernel outperformed the linear. Cross validation was used to test varying values of Cost (C) which is a term that penalizes large residuals, and sigma which sets a threshold distance that disqualifies points from being included in the distance calculation of a given support vector. The values of C and sigma included a matrix of all pair-wise combinations of C=0.01, 0.1, 1, 2 and sigma= 0.01, 0.1. The combination that minimized RMSE was sigma = 0.1 and C = 2.
# train model
set.seed(2021)
grid <- expand.grid(C=c(0.01, 0.1, 1, 2),sigma=c(0.01, 0.1))
svm_mod <- train(x=trainX, y=trainY$PH,
method="svmRadial",
tuneLength=10,
trControl=ctrl,
tuneGrid=grid)
# plot
svm_mod %>%
ggplot() +
ggtitle("Results of SVM cross validation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
# make predictions on test set data
svm_pred <- predict(svm_mod, newdata = testX) %>% as.data.frame()
# get accuracy metrics of test set data
svm_acc <- postResample(pred=svm_pred, obs=testY)
For the Neural NEtwork (nnet) model. The hyperparameter size refers to the number of hidden layers of the neural net. Weight decay is a term used to regularize the model, penalizing large coefficients. Varying values for size and decay were tested in the model training process and included all pair-wise combinations of decay=0.001, 0.01, 0.1 and size= integers from 1 to 10. Highly correlated predictors were filtered prior to the model run since nnet models are highly sensitive.
Tuning parameters: size = 9, decay = 0.01 and bag = FALSE.
# remove high correlated variables
tooHigh <- findCorrelation(cor(trainX), cutoff=.75)
trainXnnet <- trainX[,-tooHigh]
testXnnet <- testX[,-tooHigh]
# Grid of hyperparameters
nnetGrid <- expand.grid(.decay=c(0.001, 0.01,.1),
.size=c(1:10),
.bag=F)
# run model
nnet_mod <- train(x=trainXnnet, y=trainY$PH,
method="avNNet",
tuneGrid=nnetGrid,
trControl=ctrl,
linout=T,
trace=F,
MaxNWts = 10 * (ncol(training_cleaned)+1) + 10 +1,
maxit=500
)
# plot
nnet_mod %>%
ggplot() +
ggtitle("Results of nnet cross validation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
# predict
nnet_pred <- predict(nnet_mod$finalModel, newdata = testXnnet) %>% as.data.frame()
nnet_acc <- postResample(nnet_pred, obs=testY)
Multivariate adaptive regression splines (MARS) is a non-parametric algorithm that creates a piece-wise linear model to capture nonlinearities and interactions effects. A MARS model was fit by varying hyperparameters degree(maximum interaction term) and nprune(maximum terms in final model). The final values used for the model were nprune = 21 and degree = 3.
# create MARS grid for hyperparameters
marsGrid <- expand.grid(.degree=1:3, .nprune=4:25)
# train model
mars_mod <- train(x=trainX, y=trainY$PH,
method="earth",
tuneGrid = marsGrid,
trControl=ctrl
)
#plot
mars_mod %>%
ggplot() +
ggtitle("Results of MARS cross validation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
# predict
mars_pred <- predict(mars_mod, newdata=testX) %>% data.frame()
mars_acc <- postResample(mars_pred, obs=testY)
The final model built was a random forest. Cross-validation was used to optimize the number of predictors used as candidates at each iteration split. The final value selected for the model was mtry=11
set.seed(1984)
rf_mod <- train(x=trainX, y=trainY$PH,
method="rf",
tuneLength = 10,
trControl = ctrl)
#plot
rf_mod %>%
ggplot() +
ggtitle("Results of random forest cross validation") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
# predict
rf_pred <- predict(rf_mod, newdata=testX) %>% data.frame()
rf_acc <- postResample(rf_pred, obs=testY)
The six candidate models were then evaluated based on accuracy metrix on the test set, data unseen to the model, and therefore indicative of predictive power. Accuracy metrics evaluated include RMSE, R-squared, and MAE, and are provided in tabular and graphical form.
Table of Results
table <- data.frame("PLS" = pls_acc %>% round(3),
"KNN" = knn_acc %>% round(3),
"SVM" = svm_acc %>% round(3) ,
"nNet" = nnet_acc %>% round(3),
"MARS" = mars_acc %>% round(3),
"RF" = rf_acc %>% round(3)) %>%
t %>%
data.frame
table$tuning_parameter <- c("ncomp", "k", "signma, C", "size, decay", "nprune, degree", "mytry" )
table$value <- c("7", "7", "0.1, 2", "9, 0.01", "21, 3", "11")
table
Barplot of results
table$model <- rownames(table)
table %>%
select(RMSE, Rsquared, MAE,model) %>%
pivot_longer(c(RMSE,Rsquared,MAE)) %>%
arrange(value) %>%
ggplot(aes(x=model,y=value))+
geom_bar(stat="identity", col="black",fill="dodgerblue", alpha=.5)+
geom_text(aes(label=value),vjust=1.5, color="white")+
facet_wrap(~name, nrow=3, scales = "free_y") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")
)
Based on the performance of models on the test set, the random forest model was selected as the best predictor of pH in the bottling process. Predictions made using the training and tuned random forest model are provided in the included Excel document. Finally, calculations of variable importance to the random forest model are provided so future decisions regarding bottling procedures can targeted in ways that effectively manage pH levels.
# predictions
#predictions <- predict(rf_mod, testX)
# cbind(predictions,testing) %>%
# write.csv('predictions.csv')
varImp(rf_mod)%>%
plot