The data science team at ABC Beverage has been asked to provide an analysis of our manufacturing process, the predictive factors, and a predictive model of PH in order to comply with new regulations. This report details the steps taken in our analysis, including the assumptions made, the methodology used, the models tested, the model selected and the selection process, and the findings and conclusions reached from our analysis.
We were given a dataset that consisted of 31 numerical predictor variables detailing a wide range of production processes, 1 categorical variable Brand.Code, and our target variable, PH. Summary statistics for these variables can be seen in the two tables below.
| n | mean | sd | min | median | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|
| PH | 2567 | 8.55 | 0.17 | 7.88 | 8.54 | 9.36 | 1.48 | -0.29 | 0.06 | 0.00 |
| Carb.Volume | 2561 | 5.37 | 0.11 | 5.04 | 5.35 | 5.70 | 0.66 | 0.39 | -0.47 | 0.00 |
| Fill.Ounces | 2533 | 23.97 | 0.09 | 23.63 | 23.97 | 24.32 | 0.69 | -0.02 | 0.86 | 0.00 |
| PC.Volume | 2532 | 0.28 | 0.06 | 0.08 | 0.27 | 0.48 | 0.40 | 0.34 | 0.67 | 0.00 |
| Carb.Pressure | 2544 | 68.19 | 3.54 | 57.00 | 68.20 | 79.40 | 22.40 | 0.18 | -0.01 | 0.07 |
| Carb.Temp | 2545 | 141.09 | 4.04 | 128.60 | 140.80 | 154.00 | 25.40 | 0.25 | 0.24 | 0.08 |
| PSC | 2538 | 0.08 | 0.05 | 0.00 | 0.08 | 0.27 | 0.27 | 0.85 | 0.65 | 0.00 |
| PSC.Fill | 2548 | 0.20 | 0.12 | 0.00 | 0.18 | 0.62 | 0.62 | 0.93 | 0.77 | 0.00 |
| PSC.CO2 | 2532 | 0.06 | 0.04 | 0.00 | 0.04 | 0.24 | 0.24 | 1.73 | 3.73 | 0.00 |
| Mnf.Flow | 2569 | 24.57 | 119.48 | -100.20 | 65.20 | 229.40 | 329.60 | 0.00 | -1.87 | 2.36 |
| Carb.Pressure1 | 2539 | 122.59 | 4.74 | 105.60 | 123.20 | 140.20 | 34.60 | 0.05 | 0.14 | 0.09 |
| Fill.Pressure | 2549 | 47.92 | 3.18 | 34.60 | 46.40 | 60.40 | 25.80 | 0.55 | 1.41 | 0.06 |
| Hyd.Pressure1 | 2560 | 12.44 | 12.43 | -0.80 | 11.40 | 58.00 | 58.80 | 0.78 | -0.14 | 0.25 |
| Hyd.Pressure2 | 2556 | 20.96 | 16.39 | 0.00 | 28.60 | 59.40 | 59.40 | -0.30 | -1.56 | 0.32 |
| Hyd.Pressure3 | 2556 | 20.46 | 15.98 | -1.20 | 27.60 | 50.00 | 51.20 | -0.32 | -1.57 | 0.32 |
| Hyd.Pressure4 | 2541 | 96.29 | 13.12 | 52.00 | 96.00 | 142.00 | 90.00 | 0.55 | 0.63 | 0.26 |
| Filler.Level | 2551 | 109.25 | 15.70 | 55.80 | 118.40 | 161.20 | 105.40 | -0.85 | 0.05 | 0.31 |
| Filler.Speed | 2514 | 3687.20 | 770.82 | 998.00 | 3982.00 | 4030.00 | 3032.00 | -2.87 | 6.71 | 15.37 |
| Temperature | 2557 | 65.97 | 1.38 | 63.60 | 65.60 | 76.20 | 12.60 | 2.39 | 10.16 | 0.03 |
| Usage.cont | 2566 | 20.99 | 2.98 | 12.08 | 21.79 | 25.90 | 13.82 | -0.54 | -1.02 | 0.06 |
| Carb.Flow | 2569 | 2468.35 | 1073.70 | 26.00 | 3028.00 | 5104.00 | 5078.00 | -0.99 | -0.58 | 21.18 |
| Density | 2570 | 1.17 | 0.38 | 0.24 | 0.98 | 1.92 | 1.68 | 0.53 | -1.20 | 0.01 |
| MFR | 2359 | 704.05 | 73.90 | 31.40 | 724.00 | 868.60 | 837.20 | -5.09 | 30.46 | 1.52 |
| Balling | 2570 | 2.20 | 0.93 | -0.17 | 1.65 | 4.01 | 4.18 | 0.59 | -1.39 | 0.02 |
| Pressure.Vacuum | 2571 | -5.22 | 0.57 | -6.60 | -5.40 | -3.60 | 3.00 | 0.53 | -0.03 | 0.01 |
| Oxygen.Filler | 2559 | 0.05 | 0.05 | 0.00 | 0.03 | 0.40 | 0.40 | 2.66 | 11.09 | 0.00 |
| Bowl.Setpoint | 2569 | 109.33 | 15.30 | 70.00 | 120.00 | 140.00 | 70.00 | -0.97 | -0.06 | 0.30 |
| Pressure.Setpoint | 2559 | 47.62 | 2.04 | 44.00 | 46.00 | 52.00 | 8.00 | 0.20 | -1.60 | 0.04 |
| Air.Pressurer | 2571 | 142.83 | 1.21 | 140.80 | 142.60 | 148.20 | 7.40 | 2.25 | 4.73 | 0.02 |
| Alch.Rel | 2562 | 6.90 | 0.51 | 5.28 | 6.56 | 8.62 | 3.34 | 0.88 | -0.85 | 0.01 |
| Carb.Rel | 2561 | 5.44 | 0.13 | 4.96 | 5.40 | 6.06 | 1.10 | 0.50 | -0.29 | 0.00 |
| Balling.Lvl | 2570 | 2.05 | 0.87 | 0.00 | 1.48 | 3.66 | 3.66 | 0.59 | -1.49 | 0.02 |
| Brand.Code | |
|---|---|
| : 120 | |
| A: 293 | |
| B:1239 | |
| C: 304 | |
| D: 615 |
Our predictors have a wide range of distributions with some normal, some skewed, some bi-modal, and some with high zero inflation. Standardization and normalization were used for model building, the specifics of which will be described for each model in the “Models” section of the report below. Our target, PH, has a mostly normal distribution.
Since we only have one categorical variable, Brand.Code, which is related to brand and marketing rather than manufacturing process, we plotted it against each predictor and the target to see if there were any noticeable patterns that may be relevant to our model. As you can see in the plots below the brand code is evenly distributed among most predictors and most importantly evenly distributed in our target variable, PH, so it does not appear to have any predictive value and so was removed from our dataset for model training.
There was some missing data in our predictors most noticeably in MFR, which had 8.25% missing values as can be seen in the plot below. There doesn’t seem to be any pattern in the missingness however, so it is unlikely that it has any predictive value.
The plot below confirms that there is no apparent pattern in the missing values. Incomplete cases comprise only about 17% of the substantial dataset, and since there did not seem to be any patterns to the missingness, the decision was made to remove them completely from our dataset leaving us with the remaining 2129 complete cases.
The plots below were used to assess if there were any clear linear relationships between the predictors and the target, PH. Few, if any, relationships are immediately apparent.
A correlation plot shows some strong correlations between predictors. The findCorrelation function from the caret library recommends removing the MFR, Hyd.Pressure2, Carb.Rel, Air.Pressurer, Carb.Flow, Hyd.Pressure4, and Filler.Level variables at a 0.85 correlation cutoff. Upping the cutoff to 0.9 only removes one variable, Carb.Flow, from that list.
| x |
|---|
| MFR |
| Hyd.Pressure2 |
| Carb.Rel |
| Air.Pressurer |
| Carb.Flow |
| Hyd.Pressure4 |
| Filler.Level |
Before we removed any variables we decided to do our own analysis of the highly correlated variables to see if we came up with the same conclusions as the findCorrelation function. First we found all pairs of variables that had a 0.85 correlation or more, which was 13 pairs. Then we found the frequency of each variable that was a member of one or more of these pairs. The results can been seen in the two tables below.
| Var1 | Var2 | Correlation |
|---|---|---|
| Balling | Balling.Lvl | 0.99 |
| Filler.Level | Bowl.Setpoint | 0.98 |
| Density | Balling.Lvl | 0.96 |
| Density | Balling | 0.95 |
| Filler.Speed | MFR | 0.95 |
| Alch.Rel | Balling.Lvl | 0.94 |
| Balling | Alch.Rel | 0.94 |
| Hyd.Pressure2 | Hyd.Pressure3 | 0.92 |
| Density | Alch.Rel | 0.92 |
| Alch.Rel | Carb.Rel | 0.88 |
| Carb.Rel | Balling.Lvl | 0.87 |
| Balling | Carb.Rel | 0.85 |
| Density | Carb.Rel | 0.85 |
| vars | Freq |
|---|---|
| Alch.Rel | 4 |
| Balling | 4 |
| Balling.Lvl | 4 |
| Carb.Rel | 4 |
| Density | 4 |
| Bowl.Setpoint | 1 |
| Filler.Level | 1 |
| Filler.Speed | 1 |
| Hyd.Pressure2 | 1 |
| Hyd.Pressure3 | 1 |
| MFR | 1 |
What we found is that there were exactly 5 variables that were most frequently associated with highly correlated pairs. Each of these variables, Alch.Rel, Balling, Balling.Lvl, Carb.Rel, Density was involved in 4 pairs. None of the other variables were involved in more than one pair. By removing Balling.Lvl, we could eliminate 4 of our highly correlated pairs, by removing Density we got rid of 3 more, and with Balling 2 more. so we were left with only the following 4 pairs of highly correlated variables.
| Var1 | Var2 | Correlation |
|---|---|---|
| Filler.Level | Bowl.Setpoint | 0.98 |
| Filler.Speed | MFR | 0.95 |
| Hyd.Pressure2 | Hyd.Pressure3 | 0.92 |
| Alch.Rel | Carb.Rel | 0.88 |
Each of the variables in these pairs only appears once so we needed to get rid of one variable from each pair in order to eliminate all pairs of variables with a 0.85 correlation or more. So we decide to remove the 4 with the lowest correlation to PH without removing two from the same pair. This eliminated Filler.Speed, Alch.Rel, Hyd.Pressure2 and Filler.Level.
So in the end we still removed 7 variables but not all the same ones recommended by the findCorrelation function. Only 2 of the variables recommended by the function matched our list, Hyd.Pressure2 and Filler.Level.
Next we partitioned our dataset into training and validation subsets by randomly selecting 70% for training and leaving the remaining 30% set aside for testing.
We then tuned a full range of model types including: Linear Regression, Ridge Regression, Lasso, Random Forest, Tree Bag, CTree, Classification and Regression Tree (CART), Multivariate Adaptive Regression Splines (MARS), K-Nearest Neighbors (KNN) and Support Vector Machine (SVM) using repeated cross-validation on all models. The RMSE, \(R^2\), and MAE statistics for each of these models are presented in the table below, ordered by the lowest RMSE to highest and thus best predictive performance to worst.
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Random Forest | 0.120 | 0.559 | 0.088 |
| Tree Bag | 0.134 | 0.455 | 0.102 |
| SVM | 0.136 | 0.444 | 0.097 |
| MARS | 0.137 | 0.413 | 0.104 |
| KNN | 0.142 | 0.369 | 0.106 |
| Lasso | 0.145 | 0.349 | 0.111 |
| Ridge Regression | 0.145 | 0.348 | 0.111 |
| Linear Regression | 0.145 | 0.348 | 0.111 |
| CTree | 0.151 | 0.292 | 0.116 |
| CART | 0.159 | 0.205 | 0.125 |
The random forest model was selected for further tuning based on the lowest RMSE and MAE statistics. Although it also had the highest \(R^2\) value that statistic should only be used to compare performance between variously tuned models of the same type, not between models of different types, so it’s relevance is not significant in this case.
For comparison we also plotted a tree diagram which gave us similar results with the top three predictors also taking the top 3 nodes in the tree.
Since we had removed 7 predictors before tuning our models we decided to try re-tuning the best performing model, the random forest model, using the full set of predictors. This resulted in a small improvement in performance on the validation set as measured by the RMSE and MAE as well as the improved \(R^2\) value as shown in the table below.
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Random Forest Model All Predictors | 0.111 | 0.609 | 0.076 |
| %IncMSE | IncNodePurity | |
|---|---|---|
| Mnf.Flow | 0.01220 | 6.7313 |
| Usage.cont | 0.00579 | 4.5364 |
| Bowl.Setpoint | 0.00538 | 2.7553 |
| Temperature | 0.00259 | 2.6779 |
| Carb.Rel | 0.00380 | 2.4227 |
| Filler.Level | 0.00314 | 2.3245 |
| Balling.Lvl | 0.00311 | 2.1875 |
| Oxygen.Filler | 0.00262 | 2.1504 |
| Alch.Rel | 0.00382 | 2.0364 |
| Carb.Pressure1 | 0.00126 | 1.8438 |
Three more linear models were tested using the top predictors from our random forest model, however, none of them resulted in any improvement in performance.
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Linear Regression 2 | 0.145 | 0.348 | 0.111 |
| Linear Regression 4 | 0.147 | 0.329 | 0.115 |
| Linear Regression 3 | 0.147 | 0.328 | 0.115 |
One interesting finding from this experiment was that we were able to determine that the impact of Mnf.Flow, Usage.conf, Temperature, Oxygen.Filler, and Pressure.Setpoint are negative due to negative coefficients and the impact of Carb.Rel, Bowl.Setpoint, and Hyd.Pressure3 are positive due to positive coefficients. So there is a balancing act between these variables with some pulling in one direction on the pH and some in the other. Thus a change in one may necessitate a change in the others. The model coefficients can be seen in the model summary below.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5903 -0.0800 0.0121 0.0939 0.3733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4968043 0.3338057 28.45 < 0.0000000000000002 ***
## Mnf.Flow -0.0007197 0.0000614 -11.72 < 0.0000000000000002 ***
## Usage.cont -0.0064591 0.0014898 -4.34 0.0000155215769 ***
## Carb.Rel 0.1779166 0.0305671 5.82 0.0000000071794 ***
## Bowl.Setpoint 0.0011607 0.0003121 3.72 0.00021 ***
## Temperature -0.0243467 0.0034953 -6.97 0.0000000000049 ***
## Oxygen.Filler -0.4484166 0.1105615 -4.06 0.0000525618345 ***
## Pressure.Setpoint -0.0064261 0.0021279 -3.02 0.00257 **
## Hyd.Pressure3 0.0022136 0.0003730 5.93 0.0000000036612 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.139 on 1481 degrees of freedom
## Multiple R-squared: 0.322, Adjusted R-squared: 0.318
## F-statistic: 87.8 on 8 and 1481 DF, p-value: <0.0000000000000002
Mnf.Flow AnalysisAccording to the random forest and linear regression (with scaled data) models, the variable which accounts for the highest proportion of model variance is the Mnf.Flow (minimum night flow?) measure.
It is not surprising that Mnf.Flow appears in our model as the most critical, given that it is also the variable that has the highest absolute correlation with PH (-.459). Top-ranking correlations to PH are listed here by absolute value:
| correlation | |
|---|---|
| Mnf.Flow | -0.459 |
| Bowl.Setpoint | 0.362 |
| Usage.cont | -0.358 |
| Filler.Level | 0.352 |
| Fill.Pressure | -0.317 |
| Pressure.Setpoint | -0.312 |
| Hyd.Pressure3 | -0.268 |
| Carb.Flow | 0.234 |
| Hyd.Pressure2 | -0.223 |
| Pressure.Vacuum | 0.220 |
However as we investigate further into the predictive value of this variable, we note its tri-modal shape, with a huge spike at -100, accounting for over 46% of all values. There is another spike close to 0 (3% of observations), and then a third near-normal distribution centered around 140. Furthermore, when we test the correlation between PH and Mnf.Flow for only strictly positive values of Mnf.Flow, any correlation almost completely disappears (-0.05567).
The irregular distribution of the data for this variable allows us to consider that we are actually looking at four distinct manufacturing processes, and that the -100 value is actually more of a categorical placekeeper. As such, we delve into this variable further by mutating into four ordinal values separated by buckets of Mnf.FLow in the following ranges {1: [-1000, -1), 2: [-1, 1), 3: [1, 140), 4: [140, 1000]}
Here we are careful to retain the original dataset, rather than one which removed incomplete cases. We do so because a disproportionate amount of cases of type 2 (in range of -1 to 1) have at least one missing value. In fact only 1% of the observations in that range are complete.
| Var1 | Freq |
|---|---|
| 1 | 1184 |
| 2 | 79 |
| 3 | 637 |
| 4 | 669 |
With further evidence from the means of the binned Mnf.Flow data, we observe that although type 1 (negative Mnf.Flow) have much higher pH on average, type 2 (near-zero) has much lower pH, types 3 and 4 (positive above and below mean values) have little difference in mean pH. This is supported by the lack of correlation with pH with Mnf.Flow reported above, and the violinplot of pH values for each type below:
As such, the greatest predictor of beverage pH is whether Mnf.Flow has a negative value. The quantity itself is not likely to be useful as a model input as the negative value likely references an entirely different production process.
In another experiment we divided the dataset into subsets according to Brand.Code in order to assess what production processes are most relevant for each brand type. We imputed missing values by replacing them with the trimmed mean and then applied a random forest model to each of the four sets. Our aim was to determine if the variables found to be most important for the whole dataset carry through to the subsets.
Interestingly the random forest model performed most poorly on the brand with the highest frequency in our dataset as can be seen in the table below.
| RMSE | Rsquared | MAE | freq | |
|---|---|---|---|---|
| C | 0.166 | 0.166 | 0.133 | 304 |
| A | 0.173 | 0.123 | 0.140 | 293 |
| D | 0.178 | 0.053 | 0.139 | 615 |
| B | 0.211 | 0.023 | 0.168 | 1239 |
The mean pH for our dataset is 8.55, however, it is possible that pH may vary with brand profile. From this violin plot, we observe that the distribution of pH values for Brand D tends to be above mean, while that of Brand C is markedly below mean. We further investigated what factors determine the acidic signature of Brand C, with the conclusion that lower balling method levels (which promote solution alkalinity) may at least partially contribute.
We discovered that Mnf.Flow is no longer the most important variable at the brand level; rather, Temperature is, ranking in the top five for each of the four brands. By contrast Mnf.Flow only shows up in the top 5 list for two brands and in the 3rd and 5th spots. These results suggest that Mnf.Flow may not be as robust a predictor as our other models indicated.
The goodness of fit plot below shows that our predictors fall close to the fit line. See the accompanying csv file, “predicted_eval_values_PH.csv” for predictions of pH made by applying our Random Forest model to new data.
The main processes putting downward (acetic) pressure on pH are: Mnf.Flow, Usage.conf, Temperature, Oxygen.Filler, and Pressure.Setpoint when increased; Positive adjustment may be attained through increase in Carb.Rel, Bowl.Setpoint, and Hyd.Pressure3;
There is strong correlation between several of the manufacturing processes, in particular: MFR, Hyd.Pressure2, Carb.Rel, Air.Pressurer, Carb.Flow, Hyd.Pressure4, and Filler.Level;
Some of the observations have missing data in our predictors, most noticeably in MFR, which had 8.25% missing values, as well as Mnf.Flow when in the range of 0 to 1.
The metric most highly-correlated with PH, Mnf.Flow, has an irregular tri-modal distribution, with approx. 46% of values -100 or less, indicative of a distinct qualitative process of itself. Barring the negative and near-zero values, the positive values, which are approximately normal in distribution, have little correlation to PH. Mnf.Flow’s statistically significant predictive value can be wholly distilled from its transformation into a three-class categorical variable.
When the entire dataset is subsetted according to Brand.Code, a different series of critical variables emerges for each class from those of the general model. Mnf.Flow loses its force as a predictor, while Temperature and Air.Pressurer become key, ranking in the top five most important variables for each of the four brands under a random forest model.
pH varies with brand profile, especially in the case of Brand D (tending to be above-mean) and Brand C (markedly below mean). We further investigate what factors determine the acidic signature of Brand C, with the conclusion that lower balling method levels (which promote solution alkalinity) may at least partially contribute.
Since we had success in strengthening Mnf.Flow’s predictive value by transforming it into a categorical variable, we may want to investigate using the same transformation on some of the other variables with multi-modal distributions. Some of the variables with multi-modal distributions include: Alch.Rel, Balling, Balling.Lvl, Carb.Flow, Carb.Rel, Density and all three Hyd.Pressure variables. In addition, rather than transforming these variables, we may want to investigate using piecewise linear or MARS models with finer tuning in order to preserve the distributions in each bin.
Code used in analysis
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
options(scipen=999, digits = 2)
library(AppliedPredictiveModeling)
library(caret)
library(corrplot)
library(e1071)
library(earth)
library(faraway)
library(fpp2)
library(ggplot2)
library(gridExtra)
library(kableExtra)
require(knitr)
library(leaps)
library(lubridate)
library(MASS)
library(mlbench)
library(naniar)
library(pander)
library(pROC)
library(pscl)
library(psych)
library(randomForest)
library(readxl)
library(reshape)
library(reshape2)
library(rpart.plot)
library(tidyverse)
library(tseries)
library(urca)
library(ZIM)
# Table formatting functions
kab_tab <- function(df, cap){
df %>% kable(caption=cap) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = T)
}
kab_tab2 <- function(df, cap){
df %>% kable(caption=cap) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = F)
}
# # For pdf output
# kab_tab <- function(df, cap){
# kable(df, caption=cap, "latex", booktabs = T) %>%
# kable_styling(latex_options = c("striped", "HOLD_position", "scale_down"))
# }
# kab_tab2 <- function(df, cap){
# kable(df, caption=cap, "latex", booktabs = T) %>%
# kable_styling(latex_options = c("striped", "HOLD_position"))
# }
#Import Data
stEval<- read.csv("https://raw.githubusercontent.com/apag101/data624Group5/master/Project2/StudentEvaluation.csv?token=AB3M6K7RXH65YIAKWZBQ6KK6XA7R2", header = TRUE)
stDatao<- read.csv("https://raw.githubusercontent.com/apag101/data624Group5/master/Project2/StudentData.csv?token=AB3M6K4AYHCT5HBFJYHCCSC6XBACQ", header = TRUE)
# move PH to the first column of the dataframe
stData <- stDatao[,c(26,1:25,27:33)]
stEval <- stEval[,c(26,1:25,27:33)]
# https://stackoverflow.com/questions/5863097/selecting-only-numeric-columns-from-a-data-frame
num_sum <- describe(Filter(is.numeric, stData))
kab_tab(num_sum[,c(2,3,4,8,5,9,10:13)], cap="Summary statistics for numerical variables")
cat_sum <- summary(Filter(is.factor, stData))
kab_tab2(cat_sum, cap="Summary of categorical variable, Brand.Code")
stData[,-2] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = '#4575b4') +
theme(panel.background = element_blank(), legend.position="top")
stData %>%
gather(-Brand.Code, key = "var", value = "val") %>%
ggplot(aes(x = val, fill=Brand.Code)) +
geom_histogram(bins=10, alpha=1) +
facet_wrap(~ var, scales = "free") +
scale_fill_manual("Brand.Code",
values = c('#d73027','#fc8d59','#fee090',
'#e0f3f8','#91bfdb','#4575b4')) +
xlab("") +
ylab("") +
theme(panel.background = element_blank(), legend.position="top")
# Missing Data
vis_miss(stData[-2])
gg_miss_upset(stData,
nsets = 12,
nintersects = 18)
cstData <- subset(stData[-2], complete.cases(stData))
#plot checks
featurePlot(cstData[-1],cstData$PH)
#Correlation Matrix
cor.plt <- cor(cstData, use = "pairwise.complete.obs", method = "pearson")
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor.plt, method="color", col=col(200),
type="upper", order="hclust",
tl.col="black", tl.srt=45, tl.cex=0.5,
diag=FALSE
)
kab_tab2(names(cstData)[findCorrelation(cor(cstData[-1]), cutoff = .85)],
cap = "Variables recommended for removal by caret::findCorrelation function at the 0.85 cutoff")
#Displaying highly correlated pairs
cor.plt2 <- cor(cstData, use = "pairwise.complete.obs", method = "pearson")
cor.plt2[lower.tri(cor.plt2,diag=TRUE)] = NA #Prepare to drop duplicates and meaningless information
cor.plt2 <- as.data.frame(as.table(cor.plt2)) #Turn into a 3-column table
cor.plt2 <- na.omit(cor.plt2) #Get rid of the junk (NA's) we flagged above
cor.plt2 <- subset(cor.plt2, abs(cor.plt2$Freq)>0.85)
cor.plt2 <- cor.plt2[order(-abs(cor.plt2$Freq)),] #Sort by highest correlation (whether +ve or -ve)
rownames(cor.plt2) <- c()
names(cor.plt2)[3] <- "Correlation"
kab_tab2(cor.plt2, cap="Highly Correlated Variable Pairs")
vars <- c(as.character(cor.plt2$Var1), as.character(cor.plt2$Var2))
vars <- as.data.frame(table(vars))
vars <- vars[order(-vars$Freq),]
rownames(vars) <- c()
kab_tab2(vars, cap="Frequency of Variables Involved in Highly Correlated Pairs")
cstData2 <- subset(cstData, select = -c(Balling.Lvl, Density, Balling))
cor.plt3 <- cor(cstData2, use = "pairwise.complete.obs", method = "pearson")
cor.plt3[lower.tri(cor.plt3,diag=TRUE)] = NA #Prepare to drop duplicates and meaningless information
cor.plt3 <- as.data.frame(as.table(cor.plt3)) #Turn into a 3-column table
cor.plt3 <- na.omit(cor.plt3) #Get rid of the junk (NA's) we flagged above
cor.plt3 <- subset(cor.plt3, abs(cor.plt3$Freq)>0.85)
cor.plt3 <- cor.plt3[order(-abs(cor.plt3$Freq)),] #Sort by highest correlation (whether +ve or -ve)
rownames(cor.plt3) <- c()
names(cor.plt3)[3] <- "Correlation"
kab_tab2(cor.plt3, cap="Highly Correlated Variable Pairs")
temp <- subset(cstData2,
select = c(Alch.Rel, Bowl.Setpoint, Carb.Rel, Filler.Level,
Filler.Speed, Hyd.Pressure2, Hyd.Pressure3, MFR, PH))
temp <- (cor(temp, use = "pairwise.complete.obs", method = "pearson")[1:8,9])
# sort(abs(temp))
cstData3 <- subset(cstData2, select = -c(Filler.Speed, Alch.Rel, Hyd.Pressure2, Filler.Level))
#Partition Data
set.seed(123)
trainidx<-sample(nrow(cstData3),round(0.7*nrow(cstData3)),replace=F)
traindata<-cstData3[trainidx,]
testdata<-cstData3[-trainidx,]
require(caret)
set.seed(555)
trctrl<- trainControl(method="repeatedcv", number=3, repeats=2)
##Linear Regression
linreg <- caret::train(PH~., data=traindata, method="lm",
trControl=trctrl)
linPred <- predict(linreg, newdata = testdata)
m1<-data.frame(postResample(pred = linPred, obs = testdata$PH)) #0.1414880 0.3775156
##Ridge Regression
ridge <- caret::train(PH~., data=traindata, method="ridge",
trControl=trctrl)
ridgePred <- predict(ridge, newdata = testdata)
m2<-data.frame(postResample(pred = ridgePred, obs = testdata$PH)) #0.1414837 0.3775762
##Lasso Regression
lasso <- caret::train(PH~., data=traindata, method="lasso",
trControl=trctrl)
lassoPred <- predict(lasso, newdata = testdata)
m3<-data.frame(postResample(pred = lassoPred, obs = testdata$PH)) #0.1418941 0.3762947
##RandomForest (Processed)
rforest <- caret::train(PH~., data=traindata, method="cforest",
trControl=trctrl,
tuneLength =2)
forPred <- predict(rforest, newdata = testdata)
m4<-data.frame(postResample(pred = forPred, obs = testdata$PH)) #0.11550844 0.59451507
##Tree Bag
bag <- caret::train(PH~., data=traindata, method="treebag",
trControl=trctrl,
tuneLength =2)
bagPred <- predict(bag, newdata = testdata)
m5<-data.frame(postResample(pred = bagPred, obs = testdata$PH)) #0.12790436 0.50718903
##CTree
ctre <- caret::train(PH~., data=traindata, method="ctree2",
trControl=trctrl,
tuneLength =2)
ctrePred <- predict(ctre, newdata = testdata)
m6<-data.frame(postResample(pred = ctrePred, obs = testdata$PH)) #0.1519582 0.2804008
##CART
rcart<- caret::train(PH~., data=traindata, method="rpart",
trControl=trctrl,
tuneLength =2)
cartPred <- predict(rcart, newdata = testdata)
m7<-data.frame(postResample(pred = cartPred, obs = testdata$PH)) #0.1593219 0.2053639
##MARS
marsFit <- earth(PH~., data = traindata, degree=2, nprune=14)
marsPred <- predict(marsFit, newdata = testdata)
m8<-data.frame(postResample(pred = marsPred, obs = testdata$PH)) #0.1399919 0.3912855
##KNN
knnGrid <- expand.grid(k = 1:20)
knnFit <- caret::train(PH~., data = traindata,
method = "knn",
trControl = trctrl,
tuneGrid = knnGrid)
knnPred <- predict(knnFit, newdata = testdata)
m9<-data.frame(postResample(pred = knnPred, obs = testdata$PH)) #0.1278504 0.5088682
##SVM (Radial Kernel)
svmGrid <- expand.grid(C = c(1,1000))
svmFit <- caret::train(PH~., data = traindata,
#type='eps-regression',
method = 'svmRadialCost',
trControl = trctrl,
tuneGrid = svmGrid)
svmPred <- predict(svmFit, newdata = testdata)
m10<-data.frame(postResample(pred = svmPred, obs = testdata$PH)) #0.13136218 0.48751241
df<-data.frame(rbind(m1[,1],m2[,1],m3[,1],m4[,1],m5[,1],m6[,1],m7[,1],
m8[,1],m9[,1],m10[,1]))
rownames(df)<-c("Linear Regression","Ridge Regression","Lasso","Random Forest",
"Tree Bag","CTree","CART","MARS","KNN","SVM")
colnames(df)<-c("RMSE","Rsquared","MAE")
df <- df[order(df$RMSE),]
options(digits = 3)
kab_tab2(df, cap="MODELS")
#Variable Importance Ranking (Random Forest)
rfImp <- varImp(rforest, scale = FALSE)
bookTheme()
plot(rfImp, top=10, scales = list(y = list(cex = 0.8)))
options(digits = 5)
rfImp2 <- rfImp$importance[order(-rfImp$importance$Overall), , drop=FALSE]
kab_tab2(head(rfImp2, 10), cap="Variable Importance Scores")
tree <- rpart(PH~., data=traindata)
prp(tree)
#Using full dataset (applying na.roughfix to missing values)
cstData_all<-subset(stData[-2])
set.seed(123)
trainidx2<-sample(nrow(cstData_all),round(0.7*nrow(cstData_all)),replace=F)
traindata2<-cstData_all[trainidx2,]
testdata2<-cstData_all[-trainidx2,]
##Additional Random Forest tuning (TUNE HERE)
#rf.model2 <- randomForest(PH~., data=traindata2, na.action=na.roughfix)
rf.model2 <- randomForest(PH~., data=traindata2, na.action=na.roughfix, importance=TRUE)
rfPred2 <- predict(rf.model2, newdata = testdata2)
m11 <- data.frame(postResample(pred = rfPred2, obs = testdata2$PH)) #0.1097018 0.6173039
m11 <- t(m11)
row.names(m11) <- c("Random Forest Model All Predictors")
kab_tab2(m11, "Accuracy Measures for Random Forest Model using Full Set of Predictors")
#with importance=TRUE, uses approach by Breiman to calculate the variable importance reported as MeanDecreaseAccuracy
#https://stackoverflow.com/questions/37888619/difference-between-varimp-caret-and-importance-randomforest-for-random-fores
rfImp2 <- as.data.frame(importance(rf.model2, scale = FALSE))
options(digits = 5)
kab_tab2(head(rfImp2[order(-rfImp2[,2]),], 10), cap="Variable Importance Scores")
#Comparing Adj. Rsquared for OLS Linear Regression models, inputting the top-ten most important variables from Random Forest
##Linear Regression using all variables
linreg2 <- caret::train(PH~., data=traindata, method="lm", trControl=trctrl)
# summary(linreg2) #Adjusted R-squared: 0.356
linPred2 <- predict(linreg2, newdata = testdata)
m12<-data.frame(postResample(pred = linPred2, obs = testdata$PH)) #0.1414880 0.3775156
##Linear Regression using only top 10 variables
linreg3 <- caret::train(PH ~ Mnf.Flow + Usage.cont + Carb.Rel + Bowl.Setpoint +
Temperature + Pressure.Vacuum + Air.Pressurer +
Oxygen.Filler + Pressure.Setpoint + Hyd.Pressure3,
data=traindata, method="lm", trControl=trctrl)
# summary(linreg3) # Adjusted R-squared: 0.328
linPred3 <- predict(linreg3, newdata = testdata)
m13<-data.frame(postResample(pred = linPred3, obs = testdata$PH)) #0.1414880 0.3775156
##Linear Regression using only top 8 variables (Removing Pressure.Vacuum + Air.Pressurer)
linreg4 <- caret::train(PH ~ Mnf.Flow + Usage.cont + Carb.Rel + Bowl.Setpoint +
Temperature +
Oxygen.Filler + Pressure.Setpoint + Hyd.Pressure3,
data=traindata, method="lm", trControl=trctrl)
# summary(linreg4) # Adjusted R-squared: 0.329
linPred4 <- predict(linreg4, newdata = testdata)
m14<-data.frame(postResample(pred = linPred4, obs = testdata$PH)) #0.1414880 0.3775156
df<-data.frame(rbind(m12[,1],m13[,1],m14[,1]))
rownames(df)<-c("Linear Regression 2","Linear Regression 3","Linear Regression 4")
colnames(df)<-c("RMSE","Rsquared","MAE")
df <- df[order(df$RMSE),]
options(digits = 3)
kab_tab2(df, cap="MODELS")
# df
summary(linreg4) # Adjusted R-squared: 0.329
cor.ph <- t(as.data.frame(cor(as.matrix(stData['PH']),
as.matrix(stData[,c(-1,-2)]),
use = 'complete.obs')))
idx <- order(-abs(cor.ph[,1]))
cor.ph <- data.frame(correlation=cor.ph[idx,])
kab_tab2(head(cor.ph,10), cap="")
par(mfrow=c(1,2))
#Visualing the Mnf.Flow column
plot(stData$Mnf.Flow, stData$PH, main='pH/Mnf.Flow Scatterplot', xlab='Mnf.Flow values',ylab='pH')
hist(stData$Mnf.Flow, main='Histogram for Mnf.Flow', xlab='Mnf.Flow values')
stData2 <- stData
#Separating the Mnf.Flow column by thresholds
stData2$Mnf.Flow_ord <- cut(
stData2$Mnf.Flow,
breaks = c(-Inf, -1, 1, 140, Inf),
labels = c(1, 2, 3, 4),
right = FALSE
)
kab_tab2(table(stData2$Mnf.Flow_ord), cap="")
#Violin plot of Mnf.Flow by bins
g <-ggplot(stData2, aes(x=factor(stData2$Mnf.Flow_ord), y=stData2$PH))
g+geom_violin(alpha=0.5, color='grey') +
geom_jitter(alpha=0.5, size=4, aes(), position = position_jitter(width = 0.1), color='darkblue', show.legend=FALSE) +
ggtitle("PH by Mnf.Flow classification") +
coord_flip() +
xlab("Mnf.Flow") +
ylab("pH") +
theme(panel.background = element_blank(), legend.position="top")
#Subsetting data by brand
brandA <- stDatao[stDatao$Brand.Code == 'A',]
brandB <- stDatao[stDatao$Brand.Code == 'B',]
brandC <- stDatao[stDatao$Brand.Code == 'C',]
brandD <- stDatao[stDatao$Brand.Code == 'D',]
###Add Trimmed Means to NA Value
r <- colnames(cstData_all)[ apply(cstData_all, 2, anyNA)]
cstData_all[,colnames(cstData_all) %in% r]<-data.frame(sapply(cstData_all[,colnames(cstData_all) %in% r],
function(x) ifelse(is.na(x),
mean(x, na.rm = TRUE, trim = .1),
x)))
df<-data.frame()
#BrandA Training/Test Splitting
set.seed(123)
trainidxA<-sample(nrow(brandA),round(0.7*nrow(brandA)),replace=F)
traindataA<-cstData_all[trainidxA,]
testdataA<-cstData_all[-trainidxA,]
##RandomForest (on BrandA)
trctrl<- trainControl(method="repeatedcv", number=2,repeats=2)
rforestA <- caret::train(PH~., data=traindataA, method="cforest",
trControl=trctrl, tuneLength =2, na.action=na.omit)
forPredA <- predict(rforestA, newdata = testdataA)
l1<-data.frame(postResample(pred = forPredA, obs = testdataA$PH))
#BrandB Training/Test Splitting
set.seed(123)
trainidxB<-sample(nrow(brandB),round(0.7*nrow(brandB)),replace=F)
traindataB<-cstData_all[trainidxB,]
testdataB<-cstData_all[-trainidxB,]
##RandomForest (on BrandB)
trctrl<- trainControl(method="repeatedcv", number=2,repeats=2)
rforestB <- caret::train(PH~., data=traindataB, method="cforest",
trControl=trctrl, tuneLength =2, na.action=na.omit)
forPredB <- predict(rforestB, newdata = testdataB)
l2<-data.frame(postResample(pred = forPredB, obs = testdataB$PH))
#BrandC Training/Test Splitting
set.seed(123)
trainidxC<-sample(nrow(brandC),round(0.7*nrow(brandC)),replace=F)
traindataC<-cstData_all[trainidxC,]
testdataC<-cstData_all[-trainidxC,]
##RandomForest (on BrandC)
trctrl<- trainControl(method="repeatedcv", number=2,repeats=2)
rforestC <- caret::train(PH~., data=traindataC, method="cforest",
trControl=trctrl, tuneLength =2, na.action=na.omit)
forPredC <- predict(rforestC, newdata = testdataC)
l3<-data.frame(postResample(pred = forPredC, obs = testdataC$PH))
#BrandD Training/Test Splitting
set.seed(123)
trainidxD<-sample(nrow(brandD),round(0.7*nrow(brandD)),replace=F)
traindataD<-cstData_all[trainidxD,]
testdataD<-cstData_all[-trainidxD,]
##RandomForest (on BrandD)
trctrl<- trainControl(method="repeatedcv", number=2,repeats=2)
rforestD <- caret::train(PH~., data=traindataD, method="cforest",
trControl=trctrl, tuneLength =2, na.action=na.omit)
forPredD <- predict(rforestD, newdata = testdataD)
l4<-data.frame(postResample(pred = forPredD, obs = testdataD$PH))
freq <- as.data.frame(table(stDatao$Brand.Code))
rownames(freq)<-c("", "A","B","C","D")
freq <- freq[2:5,2]
df <-data.frame(rbind(l1[,1],l2[,1],l3[,1],l4[,1]))
rownames(df)<-c("A","B","C","D")
colnames(df)<-c("RMSE","Rsquared","MAE")
df <- cbind(df, freq)
df <- df[order(df$RMSE),]
kab_tab2(df, cap="BRANDS")
#pH by Brand
ggplot(stData, aes(Brand.Code, PH)) +
geom_violin(color = 'grey') +
geom_jitter(aes(color = Brand.Code), size = 0.8) +
ggtitle('pH by Brand') +
geom_hline(yintercept =8.55) +
coord_flip() +
theme(panel.background = element_blank(), legend.position="top")
par(mfrow=c(2,2)) # Doesn't work!
#Variable Importance Ranking (on Brand A)
rfImpA <- varImp(rforestA, scale = FALSE)
plot(rfImpA, top=5, scales = list(y = list(cex = 0.8)))
#Variable Importance Ranking (on Brand B)
rfImpB <- varImp(rforestB, scale = FALSE)
plot(rfImpB, top=5, scales = list(y = list(cex = 0.8)))
#Variable Importance Ranking (on Brand C)
rfImpC <- varImp(rforestC, scale = FALSE)
plot(rfImpC, top=5, scales = list(y = list(cex = 0.8)))
#Variable Importance Ranking (on Brand D)
rfImpD <- varImp(rforestD, scale = FALSE)
plot(rfImpD, top=5, scales = list(y = list(cex = 0.8)))
eval_p2 <- predict(rf.model2, newdata = stEval[-2]) # -2 to remove Brand.Code categorical variable
PHMut <- mutate(stEval, predProb = predict(rf.model2 ,stEval,type = "response"))
grpPH <- group_by(PHMut, cut(eval_p2, breaks = unique(quantile(eval_p2, (0:25)/26, na.rm=TRUE))))
#hosmer-lemeshow stat
hlDf <- summarise(grpPH, y= sum(PH), pPred=mean(predProb), count = n())
hlDf <- mutate(hlDf, se.fit=sqrt(pPred * (1-(pPred)/count)))
ggplot(hlDf,aes(x=pPred,y=y/count,ymin=y/count-2*se.fit,ymax=y/count+2*se.fit)) +
geom_point()+geom_linerange(color=grey(0.75))+geom_abline(intercept=0,slope=1) +
xlab("Predicted Probability") +
ylab("Observed Proportion") +
theme(panel.background = element_blank(), legend.position="top")
write.csv(eval_p2,"predicted_eval_values_PH.csv")
save.image(file = "Data624_Project2.RData")