This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
First, we load the libraries that we will use for our project.
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:randomForest':
##
## outlier
## corrplot 0.84 loaded
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.1 ✓ purrr 0.3.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x psych::%+%() masks ggplot2::%+%()
## x psych::alpha() masks ggplot2::alpha()
## x dplyr::combine() masks randomForest::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x ggplot2::margin() masks randomForest::margin()
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
## Loading required package: libcoin
## Loading required package: mvtnorm
Then we load our dataset and explore some of the statistics for our variables.
bev <- read.xlsx("StudentData.xlsx", sheetName = "Subset", stringsAsFactors = FALSE)
bev_eval <- read.xlsx("StudentEvaluation.xlsx", sheetName = "Subset (2)", stringsAsFactors = FALSE)
We have 2,571 samples and 32 features that we can use to train our predictive model of PH. Here are summary statistics for our features:
describe(bev)
## Warning in describe(bev): NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## vars n mean sd median trimmed mad min
## Brand.Code* 1 2451 NaN NA NA NaN NA Inf
## Carb.Volume 2 2561 5.37 0.11 5.35 5.37 0.11 5.04
## Fill.Ounces 3 2533 23.97 0.09 23.97 23.98 0.08 23.63
## PC.Volume 4 2532 0.28 0.06 0.27 0.27 0.05 0.08
## Carb.Pressure 5 2544 68.19 3.54 68.20 68.12 3.56 57.00
## Carb.Temp 6 2545 141.09 4.04 140.80 140.99 3.85 128.60
## PSC 7 2538 0.08 0.05 0.08 0.08 0.05 0.00
## PSC.Fill 8 2548 0.20 0.12 0.18 0.18 0.12 0.00
## PSC.CO2 9 2532 0.06 0.04 0.04 0.05 0.03 0.00
## Mnf.Flow 10 2569 24.57 119.48 65.20 21.07 169.02 -100.20
## Carb.Pressure1 11 2539 122.59 4.74 123.20 122.54 4.45 105.60
## Fill.Pressure 12 2549 47.92 3.18 46.40 47.71 2.37 34.60
## Hyd.Pressure1 13 2560 12.44 12.43 11.40 10.84 16.90 -0.80
## Hyd.Pressure2 14 2556 20.96 16.39 28.60 21.05 13.34 0.00
## Hyd.Pressure3 15 2556 20.46 15.98 27.60 20.51 13.94 -1.20
## Hyd.Pressure4 16 2541 96.29 13.12 96.00 95.45 11.86 52.00
## Filler.Level 17 2551 109.25 15.70 118.40 111.04 9.19 55.80
## Filler.Speed 18 2514 3687.20 770.82 3982.00 3919.99 47.44 998.00
## Temperature 19 2557 65.97 1.38 65.60 65.80 0.89 63.60
## Usage.cont 20 2566 20.99 2.98 21.79 21.25 3.19 12.08
## Carb.Flow 21 2569 2468.35 1073.70 3028.00 2601.14 326.17 26.00
## Density 22 2570 1.17 0.38 0.98 1.15 0.15 0.24
## MFR 23 2359 704.05 73.90 724.00 718.16 15.42 31.40
## Balling 24 2570 2.20 0.93 1.65 2.13 0.37 -0.17
## Pressure.Vacuum 25 2571 -5.22 0.57 -5.40 -5.25 0.59 -6.60
## PH 26 2567 8.55 0.17 8.54 8.55 0.18 7.88
## Oxygen.Filler 27 2559 0.05 0.05 0.03 0.04 0.02 0.00
## Bowl.Setpoint 28 2569 109.33 15.30 120.00 111.35 0.00 70.00
## Pressure.Setpoint 29 2559 47.62 2.04 46.00 47.60 0.00 44.00
## Air.Pressurer 30 2571 142.83 1.21 142.60 142.58 0.59 140.80
## Alch.Rel 31 2562 6.90 0.51 6.56 6.84 0.06 5.28
## Carb.Rel 32 2561 5.44 0.13 5.40 5.43 0.12 4.96
## Balling.Lvl 33 2570 2.05 0.87 1.48 1.98 0.21 0.00
## max range skew kurtosis se
## Brand.Code* -Inf -Inf NA NA NA
## Carb.Volume 5.70 0.66 0.39 -0.47 0.00
## Fill.Ounces 24.32 0.69 -0.02 0.86 0.00
## PC.Volume 0.48 0.40 0.34 0.67 0.00
## Carb.Pressure 79.40 22.40 0.18 -0.01 0.07
## Carb.Temp 154.00 25.40 0.25 0.24 0.08
## PSC 0.27 0.27 0.85 0.65 0.00
## PSC.Fill 0.62 0.62 0.93 0.77 0.00
## PSC.CO2 0.24 0.24 1.73 3.73 0.00
## Mnf.Flow 229.40 329.60 0.00 -1.87 2.36
## Carb.Pressure1 140.20 34.60 0.05 0.14 0.09
## Fill.Pressure 60.40 25.80 0.55 1.41 0.06
## Hyd.Pressure1 58.00 58.80 0.78 -0.14 0.25
## Hyd.Pressure2 59.40 59.40 -0.30 -1.56 0.32
## Hyd.Pressure3 50.00 51.20 -0.32 -1.57 0.32
## Hyd.Pressure4 142.00 90.00 0.55 0.63 0.26
## Filler.Level 161.20 105.40 -0.85 0.05 0.31
## Filler.Speed 4030.00 3032.00 -2.87 6.71 15.37
## Temperature 76.20 12.60 2.39 10.16 0.03
## Usage.cont 25.90 13.82 -0.54 -1.02 0.06
## Carb.Flow 5104.00 5078.00 -0.99 -0.58 21.18
## Density 1.92 1.68 0.53 -1.20 0.01
## MFR 868.60 837.20 -5.09 30.46 1.52
## Balling 4.01 4.18 0.59 -1.39 0.02
## Pressure.Vacuum -3.60 3.00 0.53 -0.03 0.01
## PH 9.36 1.48 -0.29 0.06 0.00
## Oxygen.Filler 0.40 0.40 2.66 11.09 0.00
## Bowl.Setpoint 140.00 70.00 -0.97 -0.06 0.30
## Pressure.Setpoint 52.00 8.00 0.20 -1.60 0.04
## Air.Pressurer 148.20 7.40 2.25 4.73 0.02
## Alch.Rel 8.62 3.34 0.88 -0.85 0.01
## Carb.Rel 6.06 1.10 0.50 -0.29 0.00
## Balling.Lvl 3.66 3.66 0.59 -1.49 0.02
Here are the distributions for our variables:
plot_histogram(bev)
We can also visualize how much data we are missing for each of the individual predictors.
apply(bev,2,function(x) sum(is.na(x)))
## Brand.Code Carb.Volume Fill.Ounces PC.Volume
## 120 10 38 39
## Carb.Pressure Carb.Temp PSC PSC.Fill
## 27 26 33 23
## PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure
## 39 2 32 22
## Hyd.Pressure1 Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4
## 11 15 15 30
## Filler.Level Filler.Speed Temperature Usage.cont
## 20 57 14 5
## Carb.Flow Density MFR Balling
## 2 1 212 1
## Pressure.Vacuum PH Oxygen.Filler Bowl.Setpoint
## 0 4 12 2
## Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 12 0 9 10
## Balling.Lvl
## 1
plot_missing(bev)
We noticed that 4 samples had missing values for PH, so we decided to remove them from our analysis since we cannot use them to predict outcomes. This left us with 2,567 observations.
#removal of samples with missing PH values
bev <- bev[!is.na(bev$PH),]
Here we addressed the NA values of Brand.Code and assigned them with a value “U”.
bev$Brand.Code[is.na(bev$Brand.Code)] <- "E"
bev_eval$Brand.Code[is.na(bev_eval$Brand.Code)] <- "E"
Next, we use KNN imputation which imputes a missing value with the average weighted value of observations near/similar to it. We perform this imputation for missing values in all variables except for the response variable, PH, and the Brand.Code predictor as it is not numerical.
#imputations
bev_imp <- knnImputation(bev[, !names(bev) %in% c("Brand.Code", "PH")])
bev_imp$PH <- bev$PH
bev_imp$Brand.Code <- bev$Brand.Code
bev_eval_imp <- knnImputation(bev_eval[, !names(bev_eval) %in% c("Brand.Code", "PH")])
bev_eval_imp$PH <- bev_eval$PH
bev_eval_imp$Brand.Code <- bev_eval$Brand.Code
We also split our data into a training and test set, using 80% of our data to train our models and holding out 20% to test them.
#data splitting
set.seed(101)
sample = sample.split(bev_imp$Brand.Code, SplitRatio = .8)
bev_train = subset(bev_imp, sample == TRUE)
bev_test = subset(bev_imp, sample == FALSE)
bev_train_X = subset(bev_train, select = -PH)
bev_train_y = bev_train[,'PH']
bev_test_X = subset(bev_test, select = -PH)
bev_test_y = bev_test[,'PH']
Linear regression is an attractive model because representation is simply done. The representation is a linear equation that combines a specific set of input values (x) the solution to which is the predicted output for that set of input values (y). In this instance, we will be predicting PH values through three different linear regression techniques.
The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
control = trainControl(method = 'cv', number = 5,
verboseIter = FALSE, savePredictions = TRUE,allowParallel = T)
set.seed(17)
GLM_bev_train = train(PH ~ ., data = bev_train, metric = 'RMSE', method = 'glm',preProcess = c('center', 'scale'), trControl = control)
GLM_reg_pred <- predict(GLM_bev_train, bev_test_X)
GLM_bev_train
## Generalized Linear Model
##
## 2053 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1643, 1642, 1642, 1643, 1642
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1337222 0.3895776 0.1032935
Using RMSE as a benchmark for linear regression modelling, we determined our GLM model performed quite well at 0.1341. We will attempt two other iterations with different linear techniques to determine if our baseline can be improved.
glmnet is an extremely efficient procedure for fitting the entire lasso or elastic-net regularization path for linear regression, logistic and multinomial regression models, Poisson regres- sion and the Cox model. Two recent additions are the multiple-response Gaussian, and the grouped multinomial regression. The algorithm uses cyclical coordinate descent in a path-wise fashion to determine the best linear fit.
set.seed(17)
glmnet_bev_train = train(PH ~ ., data = bev_train , metric = 'RMSE', method = 'glmnet',preProcess = c('center', 'scale'), trControl = control)
glmnet_reg_pred <- predict(glmnet_bev_train, bev_test_X)
glmnet_bev_train
## glmnet
##
## 2053 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1643, 1642, 1642, 1643, 1642
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0001485109 0.1336357 0.3900801 0.1033219
## 0.10 0.0014851094 0.1335344 0.3902833 0.1035784
## 0.10 0.0148510939 0.1353402 0.3767042 0.1062931
## 0.55 0.0001485109 0.1336330 0.3900154 0.1033408
## 0.55 0.0014851094 0.1339330 0.3865947 0.1040226
## 0.55 0.0148510939 0.1400201 0.3420083 0.1105175
## 1.00 0.0001485109 0.1336439 0.3898367 0.1033700
## 1.00 0.0014851094 0.1344404 0.3820734 0.1046130
## 1.00 0.0148510939 0.1440696 0.3123323 0.1136521
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.001485109.
The final values used for the model were alpha = 0.1 and lambda = 0.001485109, which produced an RMSE of 0.1340. Just a slight inprovement of 0.001. The close results can be explained by low lambda value, where a zero lambda is in effect a standard glm model. With elastic net you may be accepting some bias in return for a reduction in the variance of the estimator. The zero lambda value should in principle return the same as a (nonpenalized) glm model. Hence the near identical RMSE values.
Partial least squares (PLS) is a method for constructing predictive models when the factors are many and highly collinear. Partial least squares is a popular method for soft modelling in industrial applications. Partial least squares is a popular method for soft modelling in industrial applications. We believed that it would be an effective model to demonstrate.
set.seed(17)
pls_bev_train = train(PH ~ ., data = bev_train , metric = 'RMSE', method = 'pls',preProcess = c('center', 'scale'), trControl = control)
pls_reg_pred <- predict(pls_bev_train, bev_test_X)
pls_bev_train
## Partial Least Squares
##
## 2053 samples
## 32 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1643, 1642, 1642, 1643, 1642
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1497914 0.2328887 0.1186398
## 2 0.1417783 0.3127237 0.1114688
## 3 0.1388702 0.3399291 0.1096126
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
The final value used for the model was ncomp = 3, it produced and RMSE of 0.1394786. To our surprise, this model worked poorly among the three examples.
Although PLS accounts for over-fitting because of many manifest factors, there may be only a few underlying or latent factors that account for most of the variation in the response. The general idea of PLS is to try to extract these latent factors, accounting for as much of the manifest factor variation. In this instance, the variables do not possess latent tendencies and are directly observed. Therefore, this method of linear regression would be ineffective.
In this section, we are going to fit a simple neural network using the neuralnet package and fit a linear model as a comparison.
bdev <- bev
data <- subset(bev_imp, select = -c(Brand.Code))
Confirming that there are no more empty data:
#describe(bev)
apply(data,2,function(x) sum(is.na(x)))
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## 0 0 0 0
## Carb.Temp PSC PSC.Fill PSC.CO2
## 0 0 0 0
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 0 0 0 0
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## 0 0 0 0
## Filler.Speed Temperature Usage.cont Carb.Flow
## 0 0 0 0
## Density MFR Balling Pressure.Vacuum
## 0 0 0 0
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## 0 0 0 0
## Alch.Rel Carb.Rel Balling.Lvl PH
## 0 0 0 0
There is no missing data, good. We proceed by randomly splitting the data into a train and a test set, then we fit a linear regression model and test it on the test set. Note that I am using the gml() function instead of the lm() this will become useful later when cross validating the linear model.
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]
lm.fit <- glm(PH~., data=train)
summary(lm.fit)
##
## Call:
## glm(formula = PH ~ ., data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.60499 -0.07357 0.01180 0.09077 0.59490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.080e+01 1.194e+00 9.045 < 2e-16 ***
## Carb.Volume -1.594e-02 8.231e-02 -0.194 0.846464
## Fill.Ounces -9.524e-02 3.860e-02 -2.467 0.013695 *
## PC.Volume -1.428e-01 6.481e-02 -2.203 0.027698 *
## Carb.Pressure -7.673e-05 3.584e-03 -0.021 0.982921
## Carb.Temp 1.278e-03 2.853e-03 0.448 0.654316
## PSC -6.555e-02 7.003e-02 -0.936 0.349338
## PSC.Fill -6.658e-02 2.793e-02 -2.383 0.017251 *
## PSC.CO2 -1.694e-01 7.637e-02 -2.218 0.026706 *
## Mnf.Flow -7.357e-04 5.493e-05 -13.394 < 2e-16 ***
## Carb.Pressure1 6.715e-03 8.356e-04 8.036 1.61e-15 ***
## Fill.Pressure 6.980e-04 1.475e-03 0.473 0.636039
## Hyd.Pressure1 -3.898e-04 4.383e-04 -0.889 0.373920
## Hyd.Pressure2 -7.078e-04 6.490e-04 -1.091 0.275522
## Hyd.Pressure3 3.280e-03 7.113e-04 4.612 4.26e-06 ***
## Hyd.Pressure4 -4.157e-04 3.605e-04 -1.153 0.248972
## Filler.Level -1.428e-03 6.652e-04 -2.146 0.032007 *
## Filler.Speed 1.003e-06 9.023e-06 0.111 0.911545
## Temperature -2.129e-02 2.602e-03 -8.183 5.01e-16 ***
## Usage.cont -7.125e-03 1.379e-03 -5.166 2.64e-07 ***
## Carb.Flow 8.754e-06 4.296e-06 2.038 0.041733 *
## Density -1.355e-01 3.344e-02 -4.052 5.27e-05 ***
## MFR -5.532e-05 7.182e-05 -0.770 0.441266
## Balling -7.895e-02 2.790e-02 -2.829 0.004712 **
## Pressure.Vacuum -1.065e-02 9.098e-03 -1.171 0.241945
## Oxygen.Filler -3.073e-01 8.571e-02 -3.585 0.000345 ***
## Bowl.Setpoint 3.279e-03 6.992e-04 4.690 2.93e-06 ***
## Pressure.Setpoint -9.487e-03 2.347e-03 -4.041 5.53e-05 ***
## Air.Pressurer -1.429e-03 2.899e-03 -0.493 0.621961
## Alch.Rel 1.020e-01 1.938e-02 5.262 1.59e-07 ***
## Carb.Rel 1.156e-01 5.489e-02 2.106 0.035352 *
## Balling.Lvl 7.087e-02 2.533e-02 2.798 0.005194 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.01897517)
##
## Null deviance: 56.29 on 1924 degrees of freedom
## Residual deviance: 35.92 on 1893 degrees of freedom
## AIC: -2135.3
##
## Number of Fisher Scoring iterations: 2
pr.lm <- predict(lm.fit,test)
MSE.lm <- sum((pr.lm - test$medv)^2)/nrow(test)
MSE.lm
## [1] 0
The sample(x,size) function simply outputs a vector of the specified size of randomly selected samples from the vector x. By default the sampling is without replacement: index is essentially a random vector of indeces.
Since we are dealing with a regression problem, we are going to use the mean squared error (MSE) as a measure of how much our predictions are far away from the real data.
Before fitting a neural network, some preparation needs to be done.
As a first step, we are going to address data preprocessing. I will be normalizing the data before training a neural network. I chose to use the min-max method and scale the data in the interval [0,1]. Usually scaling in the intervals [0,1] or [-1,1] tends to give better results.
We therefore scale and split the data before moving on:
#data
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
Scaled returns a matrix that needs to be coerced into a data.frame.
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
#scaled
train_ <- scaled[index,]
test_ <- scaled[-index,]
As far as I know there is no fixed rule as to how many layers and neurons to use although there are several more or less accepted rules of thumb. Usually, if at all necessary, one hidden layer is enough for a vast numbers of applications. As far as the number of neurons is concerned, it should be between the input layer size and the output layer size, usually 2/3 of the input size. At least in my brief experience testing again and again is the best solution since there is no guarantee that any of these rules will fit your model best. In this dataset, we are going to use 2 hidden layers with this configuration: 32:5:3:1. The input layer has 32 inputs, the two hidden layers have 5 and 3 neurons and the output layer has, of course, a single output since we are doing regression.
Let’s fit the net: Setting the linear.output = True does regression instead of classification.
n <- names(train_)
f <- as.formula(paste("PH ~", paste(n[!n %in% "PH"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
plot(nn)
The black lines show the connections between each layer and the weights on each connection while the blue lines show the bias term added in each step. The bias can be thought as the intercept of a linear model. The net is essentially a black box so we cannot say that much about the fitting, the weights and the model. Suffice to say that the training algorithm has converged and therefore the model is ready to be used.
Now we can try to predict the values for the test set and calculate the MSE. The net will output a normalized prediction, so we need to scale it back in order to make a meaningful comparison (or just a simple prediction).
pr.nn <- compute(nn,test_[,1:32])
pr.nn_ <- pr.nn$net.result*(max(data$PH)-min(data$PH))+min(data$PH)
test.r <- (test_$PH)*(max(data$PH)-min(data$PH))+min(data$PH)
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
print(paste(MSE.lm,MSE.nn))
## [1] "0 0.0177398095734633"
Apparently, the Linear Model is doing a better job than the Neural Network at predicting PH in this case as the MSE is 0 for the lm.
lm.rmse <- sqrt(MSE.lm)
print(lm.rmse)
## [1] 0
The RMSE for our Neural Network is as follows:
nn.rmse <- sqrt(MSE.nn)
print(nn.rmse)
## [1] 0.1331909
The RMSE for the linear model is also 0.
plot(test$PH,pr.nn_,col='red',main='Real vs predicted NN',pch=18,cex=0.7)
points(test$PH,pr.lm,col='blue',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend=c('NN','LM'),pch=18,col=c('red','blue'))
In this next part, we consider various tree models to predict the PH of a beverage given our information about the 32 manufacturing features.
Classification and regression trees can be generated througn rpart to create simple tree models. Tree-based models consist of one or more nested if-then statements for the predictors that partition the data. A model is used to predict the outcome within these partitions.
bevb <- train(x = bev_train_X, y = bev_train_y, method = "rpart", preProcess = c('center', 'scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
bevb
## CART
##
## 2053 samples
## 32 predictor
##
## Pre-processing: centered (31), scaled (31), ignore (1)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2053, 2053, 2053, 2053, 2053, 2053, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.04517033 0.1440003 0.2811415 0.1131109
## 0.06148521 0.1483355 0.2369000 0.1158088
## 0.21000757 0.1613637 0.1948827 0.1278185
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04517033.
bevbPred <- predict(bevb, newdata = bev_test_X)
bevb.results <- postResample(pred = bevbPred, obs = bev_test_y)
bevb.results
## RMSE Rsquared MAE
## 0.1465564 0.3279953 0.1168731
The RMSE for this model is 0.147 with an RM^2 of 0.38.
We also plot this specific tree below:
plot(as.party(bevb$finalModel))
Next, we will try out a Random Forest model to see if we can improve upon this model.
Random Forest is an ensemble model where each tree splits out a class prediction and the class with the most contributions becomes the model’s prediction value. Random Forest creates as many trees on the subset of the data and combines the output of all the trees. This thus reduces problems in overfitting and reduces the variance.
bev_train_X2 <- bev_train_X
bev_train_X2$Brand.Code <- as.factor(bev_train_X2$Brand.Code)
bevrf <- train(x = bev_train_X2, y = bev_train_y, method = "rf", preProcess = c('center', 'scale'))
bevrf
## Random Forest
##
## 2053 samples
## 32 predictor
##
## Pre-processing: centered (31), scaled (31), ignore (1)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2053, 2053, 2053, 2053, 2053, 2053, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1143393 0.5966905 0.08632714
## 17 0.1021765 0.6533844 0.07452249
## 32 0.1023225 0.6450190 0.07425013
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 17.
bev_test_X2 <- bev_test_X
bev_test_X2$Brand.Code <- as.factor(bev_test_X2$Brand.Code)
bevrfPred <- predict(bevrf, newdata = bev_test_X2)
bevrf.results <- postResample(pred = bevrfPred, obs = bev_test_y)
bevrf.results
## RMSE Rsquared MAE
## 0.09605637 0.73994283 0.07029959
The RMSE for this model is 0.096 with a RM^2 of 0.739. We will also consider an XGBoost model to see if we can find any improvements.
XGBoost is another ensemble model, this time using the gradient boosting framework which is a special case of boosting where errors are minimized by gradient descent algorithm. XGBoost only manages numeric vectors, so we have to recode our Brand.Code feature into numeric before tuning our model.
bev_train_X3 <- bev_train_X
bev_train_X3$Brand.Code <- as.numeric(bev_train_X2$Brand.Code)
bevxgb <- train(x = bev_train_X3, y = bev_train_y, method = "xgbTree")
bevxgb
## eXtreme Gradient Boosting
##
## 2053 samples
## 32 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2053, 2053, 2053, 2053, 2053, 2053, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds RMSE Rsquared
## 0.3 1 0.6 0.50 50 0.1348444 0.3823046
## 0.3 1 0.6 0.50 100 0.1320053 0.4038251
## 0.3 1 0.6 0.50 150 0.1311875 0.4109874
## 0.3 1 0.6 0.75 50 0.1345659 0.3863117
## 0.3 1 0.6 0.75 100 0.1314800 0.4095951
## 0.3 1 0.6 0.75 150 0.1304921 0.4172032
## 0.3 1 0.6 1.00 50 0.1346729 0.3879390
## 0.3 1 0.6 1.00 100 0.1314202 0.4127691
## 0.3 1 0.6 1.00 150 0.1301949 0.4214902
## 0.3 1 0.8 0.50 50 0.1345012 0.3853522
## 0.3 1 0.8 0.50 100 0.1317437 0.4060782
## 0.3 1 0.8 0.50 150 0.1311550 0.4114072
## 0.3 1 0.8 0.75 50 0.1343933 0.3882591
## 0.3 1 0.8 0.75 100 0.1314519 0.4100160
## 0.3 1 0.8 0.75 150 0.1302779 0.4192618
## 0.3 1 0.8 1.00 50 0.1341937 0.3931406
## 0.3 1 0.8 1.00 100 0.1312519 0.4144106
## 0.3 1 0.8 1.00 150 0.1299507 0.4236685
## 0.3 2 0.6 0.50 50 0.1271272 0.4472720
## 0.3 2 0.6 0.50 100 0.1249203 0.4668120
## 0.3 2 0.6 0.50 150 0.1242268 0.4749295
## 0.3 2 0.6 0.75 50 0.1255304 0.4618236
## 0.3 2 0.6 0.75 100 0.1228114 0.4840227
## 0.3 2 0.6 0.75 150 0.1217355 0.4938568
## 0.3 2 0.6 1.00 50 0.1258326 0.4600049
## 0.3 2 0.6 1.00 100 0.1223547 0.4876801
## 0.3 2 0.6 1.00 150 0.1212365 0.4972870
## 0.3 2 0.8 0.50 50 0.1262705 0.4544991
## 0.3 2 0.8 0.50 100 0.1247231 0.4688732
## 0.3 2 0.8 0.50 150 0.1240470 0.4771093
## 0.3 2 0.8 0.75 50 0.1257802 0.4591446
## 0.3 2 0.8 0.75 100 0.1228193 0.4842349
## 0.3 2 0.8 0.75 150 0.1217467 0.4942752
## 0.3 2 0.8 1.00 50 0.1246914 0.4703939
## 0.3 2 0.8 1.00 100 0.1215464 0.4946886
## 0.3 2 0.8 1.00 150 0.1204430 0.5038795
## 0.3 3 0.6 0.50 50 0.1243324 0.4714911
## 0.3 3 0.6 0.50 100 0.1233583 0.4841541
## 0.3 3 0.6 0.50 150 0.1230778 0.4892816
## 0.3 3 0.6 0.75 50 0.1215663 0.4939680
## 0.3 3 0.6 0.75 100 0.1194024 0.5135975
## 0.3 3 0.6 0.75 150 0.1188775 0.5195427
## 0.3 3 0.6 1.00 50 0.1210272 0.4987666
## 0.3 3 0.6 1.00 100 0.1185590 0.5193150
## 0.3 3 0.6 1.00 150 0.1179502 0.5252458
## 0.3 3 0.8 0.50 50 0.1242515 0.4725333
## 0.3 3 0.8 0.50 100 0.1227746 0.4895285
## 0.3 3 0.8 0.50 150 0.1228564 0.4927192
## 0.3 3 0.8 0.75 50 0.1221868 0.4892047
## 0.3 3 0.8 0.75 100 0.1202711 0.5072904
## 0.3 3 0.8 0.75 150 0.1198294 0.5127807
## 0.3 3 0.8 1.00 50 0.1210405 0.4985188
## 0.3 3 0.8 1.00 100 0.1188145 0.5174223
## 0.3 3 0.8 1.00 150 0.1182544 0.5231257
## 0.4 1 0.6 0.50 50 0.1341094 0.3857087
## 0.4 1 0.6 0.50 100 0.1322465 0.4018726
## 0.4 1 0.6 0.50 150 0.1318573 0.4068002
## 0.4 1 0.6 0.75 50 0.1333919 0.3937111
## 0.4 1 0.6 0.75 100 0.1311540 0.4110954
## 0.4 1 0.6 0.75 150 0.1305185 0.4171463
## 0.4 1 0.6 1.00 50 0.1330359 0.3998425
## 0.4 1 0.6 1.00 100 0.1304134 0.4192708
## 0.4 1 0.6 1.00 150 0.1294001 0.4271819
## 0.4 1 0.8 0.50 50 0.1341155 0.3855635
## 0.4 1 0.8 0.50 100 0.1326574 0.3979709
## 0.4 1 0.8 0.50 150 0.1327138 0.3999915
## 0.4 1 0.8 0.75 50 0.1330398 0.3975192
## 0.4 1 0.8 0.75 100 0.1308921 0.4134917
## 0.4 1 0.8 0.75 150 0.1302316 0.4197039
## 0.4 1 0.8 1.00 50 0.1331379 0.3985828
## 0.4 1 0.8 1.00 100 0.1305303 0.4183593
## 0.4 1 0.8 1.00 150 0.1295697 0.4255322
## 0.4 2 0.6 0.50 50 0.1283154 0.4375900
## 0.4 2 0.6 0.50 100 0.1270359 0.4544894
## 0.4 2 0.6 0.50 150 0.1268460 0.4606956
## 0.4 2 0.6 0.75 50 0.1261218 0.4555841
## 0.4 2 0.6 0.75 100 0.1241155 0.4752960
## 0.4 2 0.6 0.75 150 0.1236380 0.4817992
## 0.4 2 0.6 1.00 50 0.1247331 0.4677182
## 0.4 2 0.6 1.00 100 0.1221623 0.4900726
## 0.4 2 0.6 1.00 150 0.1216626 0.4959950
## 0.4 2 0.8 0.50 50 0.1277910 0.4421537
## 0.4 2 0.8 0.50 100 0.1267644 0.4561743
## 0.4 2 0.8 0.50 150 0.1269287 0.4602925
## 0.4 2 0.8 0.75 50 0.1256522 0.4594194
## 0.4 2 0.8 0.75 100 0.1236538 0.4791164
## 0.4 2 0.8 0.75 150 0.1232684 0.4856178
## 0.4 2 0.8 1.00 50 0.1243008 0.4714418
## 0.4 2 0.8 1.00 100 0.1223927 0.4881998
## 0.4 2 0.8 1.00 150 0.1218716 0.4943462
## 0.4 3 0.6 0.50 50 0.1277691 0.4488687
## 0.4 3 0.6 0.50 100 0.1279276 0.4575580
## 0.4 3 0.6 0.50 150 0.1281195 0.4612835
## 0.4 3 0.6 0.75 50 0.1232397 0.4824109
## 0.4 3 0.6 0.75 100 0.1226484 0.4922922
## 0.4 3 0.6 0.75 150 0.1228919 0.4933984
## 0.4 3 0.6 1.00 50 0.1212984 0.4966803
## 0.4 3 0.6 1.00 100 0.1201459 0.5091282
## 0.4 3 0.6 1.00 150 0.1202818 0.5102040
## 0.4 3 0.8 0.50 50 0.1259074 0.4645068
## 0.4 3 0.8 0.50 100 0.1264617 0.4694143
## 0.4 3 0.8 0.50 150 0.1273376 0.4677091
## 0.4 3 0.8 0.75 50 0.1234238 0.4812640
## 0.4 3 0.8 0.75 100 0.1231713 0.4891812
## 0.4 3 0.8 0.75 150 0.1234571 0.4901053
## 0.4 3 0.8 1.00 50 0.1211727 0.4980852
## 0.4 3 0.8 1.00 100 0.1202789 0.5086490
## 0.4 3 0.8 1.00 150 0.1203230 0.5107155
## MAE
## 0.10483504
## 0.10183430
## 0.10083551
## 0.10465561
## 0.10169264
## 0.10044165
## 0.10493589
## 0.10194501
## 0.10066435
## 0.10447757
## 0.10153575
## 0.10069593
## 0.10452288
## 0.10158124
## 0.10021809
## 0.10462855
## 0.10177400
## 0.10045887
## 0.09741027
## 0.09522894
## 0.09446470
## 0.09630067
## 0.09338401
## 0.09230971
## 0.09644014
## 0.09305954
## 0.09192091
## 0.09646372
## 0.09464185
## 0.09395021
## 0.09622544
## 0.09319100
## 0.09216416
## 0.09570384
## 0.09243421
## 0.09126579
## 0.09433460
## 0.09332015
## 0.09301100
## 0.09221326
## 0.09013184
## 0.08951059
## 0.09172819
## 0.08920710
## 0.08857080
## 0.09420147
## 0.09279142
## 0.09275244
## 0.09275341
## 0.09084203
## 0.09029744
## 0.09152756
## 0.08931455
## 0.08861755
## 0.10392997
## 0.10178471
## 0.10117610
## 0.10342010
## 0.10110100
## 0.10042530
## 0.10345532
## 0.10081684
## 0.09977308
## 0.10385682
## 0.10199029
## 0.10176775
## 0.10314741
## 0.10071457
## 0.09978745
## 0.10344656
## 0.10094106
## 0.09980682
## 0.09786071
## 0.09647071
## 0.09617791
## 0.09616451
## 0.09413578
## 0.09353021
## 0.09514360
## 0.09275854
## 0.09220335
## 0.09735515
## 0.09631818
## 0.09645978
## 0.09573556
## 0.09381051
## 0.09321017
## 0.09503359
## 0.09292912
## 0.09225425
## 0.09659734
## 0.09655981
## 0.09668012
## 0.09366038
## 0.09283432
## 0.09286321
## 0.09167854
## 0.09074592
## 0.09062142
## 0.09547108
## 0.09554308
## 0.09626054
## 0.09329068
## 0.09279226
## 0.09295755
## 0.09169396
## 0.09060586
## 0.09047154
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
## = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
## = 1.
bev_test_X3 <- bev_test_X
bev_test_X3$Brand.Code <- as.numeric(as.factor(bev_test_X3$Brand.Code))
bevxgbPred <- predict(bevxgb, newdata = bev_test_X3)
bevxgb.results <- postResample(pred = bevxgbPred, obs = bev_test_y)
bevxgb.results
## RMSE Rsquared MAE
## 0.10898447 0.62870015 0.08165521
The RMSE for this model is 0.107 with a RM^2 of 0.64.
Lastly for this section, the results of these 3 Tree models are summarized here:
xgb <- as.data.frame(as.list(bevxgb.results))
basic <- as.data.frame(as.list(bevb.results))
rf <- as.data.frame(as.list(bevrf.results))
xgb$model <- 'XGBoost'
basic$model <- 'Basic Regression Tree'
rf$model <- 'Random Forest'
tree.outcomes <- rbind(xgb, basic, rf)
tree.outcomes
## RMSE Rsquared MAE model
## 1 0.10898447 0.6287002 0.08165521 XGBoost
## 2 0.14655642 0.3279953 0.11687313 Basic Regression Tree
## 3 0.09605637 0.7399428 0.07029959 Random Forest
We considered Basic Regression, Random Forest and XGBoost tree models, and Random Forest performed the best in predicting PH as it had the smallest RMSE value at 0.1 and an R^2 of 0.7.
In the Random Forest model, these were the most important predictors:
varImp(bevrf)
## rf variable importance
##
## only 20 most important variables shown (out of 32)
##
## Overall
## Mnf.Flow 100.000
## Brand.Code 50.672
## Usage.cont 44.462
## Oxygen.Filler 28.576
## Bowl.Setpoint 24.134
## Alch.Rel 22.050
## Temperature 21.678
## Pressure.Vacuum 21.432
## Carb.Rel 20.932
## Balling.Lvl 19.665
## Air.Pressurer 16.480
## Carb.Pressure1 16.451
## Filler.Level 13.592
## Filler.Speed 13.325
## Balling 12.348
## Carb.Flow 11.095
## Density 9.610
## Hyd.Pressure3 8.417
## PC.Volume 7.724
## MFR 6.259
Finally, we will choose our best model for predicting PH among the chosen Linear, Non-Linear and Tree Models that we have gone over in this analysis. Their RMSE metrics are summarized here:
lin_model_perf <- getTrainPerf(glmnet_bev_train)
print(lin_model_perf)
## TrainRMSE TrainRsquared TrainMAE method
## 1 0.1335344 0.3902833 0.1035784 glmnet
print(nn.rmse)
## [1] 0.1331909
rf <- as.data.frame(as.list(bevrf.results))
print(rf)
## RMSE Rsquared MAE
## 1 0.09605637 0.7399428 0.07029959
The RMSE for the chosen Linear Model (glm) was 0.134. The RMSE for the chosen Non-Linear Model (Neural Net) was 0.136. And lastly, the RMSE for the chosen Tree model (Random Forest) was 0.096. So, we chose the Random Forest Model to predict the PH of our bevarages given the predictors on hand.
In this chosen Random Forest model, the top 5 predictors that we found to influence PH in the manufacturing process are Mnf.Flow, Brand.Code, Usage.cont, Oxygen.Filler, and Alch.Rel. We should pay close attention to these predictors when trying to control for the PH of our beverages given the new regulations that have been enacted.
Finally, we will predict the PH values for the evaluation dataset that has been provided using the Random Forest model that was chosen as the best model. We are including these predictions with this submission as well.
bev_eval_X2 <- bev_eval_imp
bev_eval_X2$Brand.Code <- as.factor(bev_eval_X2$Brand.Code)
finalPH <- predict(bevrf, newdata = bev_eval_X2)
finalResult <- bev_eval
finalResult$PH <- finalPH
head(finalResult)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp PSC
## 1 D 5.480000 24.03333 0.2700000 65.4 134.6 0.236
## 2 A 5.393333 23.95333 0.2266667 63.2 135.0 0.042
## 3 B 5.293333 23.92000 0.3033333 66.4 140.4 0.068
## 4 B 5.266667 23.94000 0.1860000 64.8 139.0 0.004
## 5 B 5.406667 24.20000 0.1600000 69.4 142.2 0.040
## 6 B 5.286667 24.10667 0.2120000 73.4 147.2 0.078
## PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1 0.40 0.04 -100 116.6 46.0 0
## 2 0.22 0.08 -100 118.8 46.2 0
## 3 0.10 0.02 -100 120.2 45.8 0
## 4 0.20 0.02 -100 124.8 40.0 0
## 5 0.30 0.06 -100 115.0 51.4 0
## 6 0.22 NA -100 118.6 46.4 0
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1 NA NA 96 129.4 3986
## 2 0 0 112 120.0 4012
## 3 0 0 98 119.4 4010
## 4 0 0 132 120.2 NA
## 5 0 0 94 116.0 4018
## 6 0 0 94 120.4 4010
## Temperature Usage.cont Carb.Flow Density MFR Balling Pressure.Vacuum
## 1 66.0 21.66 2950 0.88 727.6 1.398 -3.8
## 2 65.6 17.60 2916 1.50 735.8 2.942 -4.4
## 3 65.6 24.18 3056 0.90 734.8 1.448 -4.2
## 4 74.4 18.12 28 0.74 NA 1.056 -4.0
## 5 66.4 21.32 3214 0.88 752.0 1.398 -4.0
## 6 66.6 18.00 3064 0.84 732.0 1.298 -3.8
## PH Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel
## 1 8.561109 0.022 130 45.2 142.6 6.56
## 2 8.482686 0.030 120 46.0 147.2 7.14
## 3 8.522793 0.046 120 46.0 146.6 6.52
## 4 8.567560 NA 120 46.0 146.4 6.48
## 5 8.486979 0.082 120 50.0 145.8 6.50
## 6 8.531598 0.064 120 46.0 146.0 6.50
## Carb.Rel Balling.Lvl
## 1 5.34 1.48
## 2 5.58 3.04
## 3 5.34 1.46
## 4 5.50 1.48
## 5 5.38 1.46
## 6 5.42 1.44
#write.csv(finalResult, "predictions.csv")