New regulations require us to better understand the manufacturing process of our beverages. In this report, we will demonstrate the process by which we created a predictive model that will determine the PH level in a can. We will outline how we prepared the data and selected our predictive model. Finally, we will use the selected model to make predictions on a new set of data.
To prepare the data for building models, we plotted the missing data. There is very little missing data, and it is widely spread across the samples. This indicates that we should be able to successfully impute the missing data without worrying about injecting too much new information into the data set.
student.data <- readxl::read_excel('./data/StudentData.xlsx')
student.evaluation <- readxl::read_excel('./data/StudentEvaluation.xlsx')
missing.data.plot <- function(data){
data %>%
VIM::aggr(col=c('navyblue', 'yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(data), cex.axis=.6,
gap=3, ylab=c('Missing Data', 'Pattern'), combined=TRUE
)
}
missing.data.plot(student.data)
##
## Variables sorted by number of missings:
## Variable Count
## MFR 212
## Brand Code 120
## Filler Speed 57
## PC Volume 39
## PSC CO2 39
## Fill Ounces 38
## PSC 33
## Carb Pressure1 32
## Hyd Pressure4 30
## Carb Pressure 27
## Carb Temp 26
## PSC Fill 23
## Fill Pressure 22
## Filler Level 20
## Hyd Pressure2 15
## Hyd Pressure3 15
## Temperature 14
## Oxygen Filler 12
## Pressure Setpoint 12
## Hyd Pressure1 11
## Carb Volume 10
## Carb Rel 10
## Alch Rel 9
## Usage cont 5
## PH 4
## Mnf Flow 2
## Carb Flow 2
## Bowl Setpoint 2
## Density 1
## Balling 1
## Balling Lvl 1
## Pressure Vacuum 0
## Air Pressurer 0
We discovered that four samples are missing the response variable PH. There are a number of ways to address this issue but ultimately we decided to simply remove these samples. The goal of the model is to predict the PH level and we were concerned about biasing the data by imputing these values. In addition there is one major outlier whose PH level is several orders of magnitude higher than the other values. Although we ultimately selected a modeling technique that is robust to outliers, we believed that this value was either incorrectly recorded or the result of a massive aberration that is unlikely to be seen again. Thus, we felt it best to remove this sample.
student.data %>%
ggplot(aes(PH, fill=PH > 9)) +
geom_histogram(bins=30) +
theme_bw() +
theme(legend.position='none') +
labs(y='Count',
title='PH Levels in Training Data')
## Warning: Removed 4 rows containing non-finite values (stat_bin).
student.data <- student.data %>%
filter(!is.na(student.data$PH),
student.data$PH < 9)
We created a simple data processing pipeline to fix a number of formatting issues with the data. We also one hot encoded the only categorical variable ‘Brand’. Most importantly, this pipeline also imputes the missing data. We selected a powerful imputation method called MICE. In short, MICE functions by calculating a unique imputation model for each predictor treating that predictor as a response variable. This process is repeated several times until the imputed values stabilize.
clean.cols <- function(name){
name <- gsub('\\s', '', name)
name <- tolower(name)
return(name)
}
pipeline <- function(data){
ph <- data$PH
data <- data %>%
select(-PH) %>%
mutate(`Brand Code` = as.factor(`Brand Code`)) %>%
rename_all(list(f = ~clean.cols(.))) %>%
mice::mice(m=5, maxit=10, seed=123, printFlag=FALSE) %>%
mice::complete(1)
x <- predict(dummyVars("~ .", data=data), newdata=data) %>%
as_tibble()
return(cbind(ph, x))
}
student.data.complete <- pipeline(student.data)
student.evaluation.complete <- pipeline(student.evaluation)
Finally, we confirm that all the data has been properly imputed.
missing.data.plot(student.data.complete)
##
## Variables sorted by number of missings:
## Variable Count
## ph 0
## brandcode.A 0
## brandcode.B 0
## brandcode.C 0
## brandcode.D 0
## carbvolume 0
## fillounces 0
## pcvolume 0
## carbpressure 0
## carbtemp 0
## psc 0
## pscfill 0
## pscco2 0
## mnfflow 0
## carbpressure1 0
## fillpressure 0
## hydpressure1 0
## hydpressure2 0
## hydpressure3 0
## hydpressure4 0
## fillerlevel 0
## fillerspeed 0
## temperature 0
## usagecont 0
## carbflow 0
## density 0
## mfr 0
## balling 0
## pressurevacuum 0
## oxygenfiller 0
## bowlsetpoint 0
## pressuresetpoint 0
## airpressurer 0
## alchrel 0
## carbrel 0
## ballinglvl 0
missing.data.plot(student.evaluation.complete)
##
## Variables sorted by number of missings:
## Variable Count
## ph 267
## brandcode.A 0
## brandcode.B 0
## brandcode.C 0
## brandcode.D 0
## carbvolume 0
## fillounces 0
## pcvolume 0
## carbpressure 0
## carbtemp 0
## psc 0
## pscfill 0
## pscco2 0
## mnfflow 0
## carbpressure1 0
## fillpressure 0
## hydpressure1 0
## hydpressure2 0
## hydpressure3 0
## hydpressure4 0
## fillerlevel 0
## fillerspeed 0
## temperature 0
## usagecont 0
## carbflow 0
## density 0
## mfr 0
## balling 0
## pressurevacuum 0
## oxygenfiller 0
## bowlsetpoint 0
## pressuresetpoint 0
## airpressurer 0
## alchrel 0
## carbrel 0
## ballinglvl 0
With the data properly cleaned, we can now perform exploratory data analysis.
After cleaning the data we proceeded to perform exploratory data analysis in order to better understand the data set that we are working with. Our data set consists of 36 variables. One is our target variable, ph. 4 Variables are based on brand. For regression models, we’ll use three of these, with brand a as the base class. Based on Cook’s distance, we removed 8 variables. For our regression based models, this improved RMSE by \(\approx .003\). 2354 had the highest Cook’s Distance, \(\approx .1\).
multi.model <- student.data.complete %>%
select(-2) %>%
filter(!row_number() %in% c(2354,2082,690,1896,1841,475,2562,2149)) %>%
lm(data=., ph~.)
multi.model %>%
influencePlot(id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )
## StudRes Hat CookD
## 367 -4.0520330 0.008196838 0.003853476
## 493 2.5018604 0.052924672 0.009973036
## 1051 0.6497145 0.117639071 0.001608355
## 1382 2.0121671 0.085853058 0.010851139
## 1660 -0.1486110 0.164553254 0.000124334
## 2550 -3.9046434 0.012993535 0.005702387
Within our model, 13 variables were significant at .001 significance: brand code b, mnfflow, carbpressure1, hydpressure3, temperature, usagecont, density, balling, pressurevacuum, oxygenfiller, bowlsetpoint, pressuresetpoint, ballinglvl. An additional 6 variables were significant at .05.
multi.model %>%
summary()
##
## Call:
## lm(formula = ph ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52781 -0.07668 0.01003 0.08698 0.41612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.117e+01 1.077e+00 10.379 < 2e-16 ***
## brandcode.B 8.871e-02 2.131e-02 4.163 3.25e-05 ***
## brandcode.C -4.207e-02 2.113e-02 -1.991 0.04662 *
## brandcode.D 5.192e-02 1.750e-02 2.967 0.00304 **
## carbvolume -1.461e-01 8.944e-02 -1.633 0.10250
## fillounces -8.211e-02 3.201e-02 -2.565 0.01037 *
## pcvolume -1.037e-01 5.387e-02 -1.924 0.05441 .
## carbpressure 3.096e-03 4.177e-03 0.741 0.45867
## carbtemp -1.668e-03 3.294e-03 -0.506 0.61257
## psc -7.863e-02 5.626e-02 -1.397 0.16240
## pscfill -3.531e-02 2.309e-02 -1.529 0.12643
## pscco2 -1.224e-01 6.248e-02 -1.959 0.05017 .
## mnfflow -7.495e-04 4.592e-05 -16.320 < 2e-16 ***
## carbpressure1 6.185e-03 6.880e-04 8.990 < 2e-16 ***
## fillpressure 1.332e-03 1.210e-03 1.102 0.27077
## hydpressure1 -4.086e-05 3.630e-04 -0.113 0.91039
## hydpressure2 -1.188e-03 5.309e-04 -2.237 0.02535 *
## hydpressure3 3.581e-03 5.830e-04 6.142 9.44e-10 ***
## hydpressure4 -8.337e-05 3.187e-04 -0.262 0.79366
## fillerlevel -8.664e-04 5.605e-04 -1.546 0.12231
## fillerspeed 1.930e-05 9.941e-06 1.941 0.05231 .
## temperature -1.612e-02 2.314e-03 -6.966 4.15e-12 ***
## usagecont -7.252e-03 1.135e-03 -6.387 2.00e-10 ***
## carbflow 8.946e-06 3.745e-06 2.389 0.01697 *
## density -1.181e-01 2.777e-02 -4.252 2.20e-05 ***
## mfr -7.817e-05 5.576e-05 -1.402 0.16105
## balling -1.464e-01 2.605e-02 -5.618 2.15e-08 ***
## pressurevacuum -3.240e-02 7.885e-03 -4.110 4.09e-05 ***
## oxygenfiller -3.800e-01 7.231e-02 -5.256 1.60e-07 ***
## bowlsetpoint 2.861e-03 5.905e-04 4.845 1.34e-06 ***
## pressuresetpoint -7.750e-03 1.962e-03 -3.950 8.02e-05 ***
## airpressurer -1.756e-03 2.343e-03 -0.749 0.45364
## alchrel 8.013e-02 2.658e-02 3.014 0.00260 **
## carbrel 4.685e-02 4.802e-02 0.976 0.32933
## ballinglvl 1.852e-01 2.628e-02 7.049 2.32e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1312 on 2523 degrees of freedom
## Multiple R-squared: 0.4221, Adjusted R-squared: 0.4143
## F-statistic: 54.21 on 34 and 2523 DF, p-value: < 2.2e-16
We found strong levels of correlation (both positive and negative) for a variety of features. We can see stripes of dark color along balling and ballinglvl.
student.data.complete %>%
cor(use='complete.obs') %>%
corrplot(method='color', type='upper')
Some of our variables showed high correlation with each other. We can see stripes of dark color along balling and ballinglvl. We check the variables for multi-collinearity. Many of our variables were high in vif. To remove this concern in our regression models, we built a model that removed variables until all had a vif score of less than 10.
student.data.complete %>%
select(c(1:2, 4:36)) %>%
lm(data=., ph~.) %>%
vif() %>%
tibble('variable' = names(.), 'variable_vif' = .) %>%
filter(variable_vif > 10) %>%
ggplot(aes(variable, variable_vif)) +
geom_bar(stat='identity', fill='#b5c6fc') +
theme(panel.background = element_rect(fill = '#707996'),
text = element_text(family = 'corben', color='#249382', size=35),
axis.text.x = element_text(angle = 30, hjust = .9)
) +
labs(x = 'Variable',
y = 'VIF',
title = 'Variables with the highest vif')
Below is a comparison of the distributions between each feature and the response variable PH.
student.data.complete %>%
gather(key='key', value='value', -ph) %>%
ggplot(aes(value, ph)) +
geom_point(alpha=.9,color='#65b285') +
facet_wrap(~key, scales='free_x', ncol=4) +
labs(y='PH')
We began our exploration for a predictive model by separating out 20% of the training data to serve as our final validation data. We also split each data frame into a list for easier use in testing various predictive models.
set.seed(123)
listify <- function(data){
return(list('y' = data$ph, 'x' = data %>% select(-ph)))
}
part <- createDataPartition(student.data.complete$ph, p=0.8, list=FALSE)
training <- student.data.complete %>%
filter(row_number() %in% part)
validation <- student.data.complete %>%
filter(!row_number() %in% part)
student.data.complete <- listify(student.data.complete)
student.evaluation.complete <- listify(student.evaluation.complete)
training <- listify(training)
validation <- listify(validation)
We iterated over numerous different predictive models including a variety of different regression based and rule based models. The code for all of these tests is available in our repo. We ultimately settled on using a Random Forest model due to it achieving the best performance in our cross fold validated testing. We assessed the performance of each model based on the root mean square error. The tabs below show a selected set of the models we trained.
This model achieved an RMSE of \(\approx0.1003\) in our cross validated testing and an \(R^2 \approx 0.65\).
grid <- expand.grid(nrounds = 2500,
max_depth = 6,
eta = 0.03,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 0.5)
control <- trainControl(method='cv',
number=10,
allowParallel = TRUE)
model <- train(x = training$x,
y = training$y,
method = 'xgbTree',
tuneGrid = grid,
metric = 'RMSE',
trControl = control)
model$results
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 1 2500 6 0.03 0 1 1 0.5
## RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.1019613 0.6476099 0.07464376 0.008630701 0.05637404 0.00453064
The best XG Boost model performed nearly as well as the Random Forest model and would serve as a good alternative choice for a model, if needed.
set.seed(200)
svmRTuned <- train(x = training$x,
y = training$y,
method="svmRadial",
preProc=c("center","scale"),
tuneLength = 14,
trControl = trainControl(method="cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 2054 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1850, 1848, 1848, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1261216 0.4716002 0.09403762
## 0.50 0.1230421 0.4937594 0.09065211
## 1.00 0.1204268 0.5136178 0.08796257
## 2.00 0.1178681 0.5333699 0.08586277
## 4.00 0.1162084 0.5456875 0.08497896
## 8.00 0.1155904 0.5517457 0.08455720
## 16.00 0.1165009 0.5493467 0.08601967
## 32.00 0.1195934 0.5343728 0.08895020
## 64.00 0.1253487 0.5068460 0.09356035
## 128.00 0.1313385 0.4824173 0.09843447
## 256.00 0.1376892 0.4567386 0.10282025
## 512.00 0.1429095 0.4361877 0.10661876
## 1024.00 0.1434758 0.4346814 0.10693952
## 2048.00 0.1434758 0.4346814 0.10693952
##
## Tuning parameter 'sigma' was held constant at a value of 0.02056593
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02056593 and C = 8.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 8
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0205659307622407
##
## Number of Support Vectors : 1731
##
## Objective Function Value : -3582.52
## Training error : 0.155419
training_set_matrix<-as.matrix(training$x)
elastic_net_model<-glmnet(as.matrix(training_set_matrix[,-1]), training$y, family="gaussian", alpha=.65, standardize = TRUE)
elnet_predict<-predict(elastic_net_model, s=elastic_net_model$lambda.1se, newx=as.matrix(training_set_matrix[,-1]))
RMSE(elnet_predict, training$y)
## [1] 0.1386991
set.seed(200)
rf <- randomForest(x = training$x,
y = training$y,
importance=TRUE,
ntree=1000)
mean(sqrt(rf$mse))
## [1] 0.1003276
mean(rf$rsq)
## [1] 0.6572274
Below we can see the order of importance for each of the predictors in determining the response variable. This table also indicates that there are a number of predictors that could possibly be dropped from the model with little loss to the model’s predictive value. This may be an interesting avenue of exploration if we wanted to prioritize a model that can be easily interpreted. As it stands, while this model is highly accurate, it is not easy to interpret.
rf$importance %>%
as.data.frame() %>%
rownames_to_column(var='predictor') %>%
arrange(desc(`%IncMSE`)) %>%
datatable()
With the model selected, we will finally run the model against the withheld validation data. This will give us a sense of how well the model will perform on data that it has never seen before. Our model scores a strong RMSE of approximately 0.0975 with an \(R^2\) of 0.7.
postResample(predict(rf, newdata=validation$x), validation$y)
## RMSE Rsquared MAE
## 0.09753956 0.70032732 0.07186069
With the model trained and validated, we can make our final predictions
Before making our final predictions we are going to retrain the Random Forest model one last time. The model will be retrained with the same hyper-parameters, however this time we will train on ALL of the training data, including the previously withheld validation data. This step is often considered optional when creating a predictive model. Folding the validation data back into the model, while leaving everything else the same, can often allow our model to eek out just a bit more predictive ability due to the extra samples. However, the improvement is often very small (if at all) and as such this step will often be ignored if the training process is long. In our case, as the data set is small, we believe there is no downside to retraining the model.
final.rf <- randomForest(x = student.data.complete$x,
y = student.data.complete$y,
importance=TRUE,
ntree=1000)
With the final model trained, we can finally make our predictions on the provided data.
predictions <- predict(final.rf, newdata=student.evaluation.complete$x)
predictions %>%
tibble::enframe(name = NULL) %>%
datatable()
Finally, as requested, we can write out the file in an Excel readable format.
predictions %>%
tibble::enframe(name = NULL) %>%
rownames_to_column() %>%
rename(PH = value,
row = rowname) %>%
write_excel_csv('./data/predictions.csv')