Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both RPubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
Insert short overview of what the data contains and what we are trying to accomplish by building a model. Discuss our overall approach and what models might be appropriate.
Describe the size and the variables in the training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below.
The dataset contains categorical, continuous and discrete features. The data contains 32 features and 2571 instances, with 267 instances reserved for an evaluation dataset with targets removed. The target column is PH. While PH should be a continuous variable, of note, there are 52 distinct values (over the 2400+ rows). Given this, possible models could include both regression and classification, or an ensemble of both.
There are two files provided:
PH column will be our target we are trying to predict.PH column is empty in this dataset - our model will have to be scored by an outside group with knowledge of the actual PH Values.Note: Both files are simple CSV format.
Below is a list of the variables of interest in the data set:
Brand Code: categorical, values: A, B, C, DCarb Volume:Fill Ounces:PC Volume:Carb Pressure:Carb Temp:PSC:PSC Fill:PSC CO2:Mnf Flow:Carb Pressure1:Fill Pressure:Hyd Pressure1:Hyd Pressure2:Hyd Pressure3:Hyd Pressure4:Filler Level:Filler Speed:Temperature:Usage cont:Carb Flow:Density:MFR:Balling:Pressure Vacuum:PH: the TARGET we will try to predict.Bowl Setpoint:Pressure Setpoint:Air Pressurer:Alch Rel:Carb Rel:Balling Lvl:We compiled summary statistics on our dataset to better understand the data before modeling.
| Name | df |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
The first observation is that we have quite a few missing data points across our features (coded as NA’s) that we will want to impute. Especially note that 4 rows are missing their PH value. We will need to drop these rows as they cannot be used for training.
Based on the summary statistics, it appears we have some highly skewed features with means that are far from the median indicating a skewed distribution. Some examples include the variables … . We also see a several variables that appears to be quite imbalanced with a large number of 0 values, e.g. Hyd Pressure1 and Hyd Pressure2. We might need to impute these.
If we treat PH as a classification problem, we need to understand any class imbalance, as this may impact predicted classification.
Our class balance is:
PH is normally distributed with some possible outliers on the low and high ends. Given this distribution, a pure classification approach may be problematic as the predictions may favor pH’s in the mid-range (since we have more data points). That said, it’s still possible there are boundaries such that classification adds predictive information. Give the normal shape, a regression or possibly an ensemble with regression and classification might be more appropriate.
Before continuing, let’s understand any missing data including which features are impacted and any patterns between missing values.
| variable | n_miss | pct_miss |
|---|---|---|
| MFR | 212 | 8.2458187 |
| Brand Code | 120 | 4.6674446 |
| Filler Speed | 57 | 2.2170362 |
| PC Volume | 39 | 1.5169195 |
| PSC CO2 | 39 | 1.5169195 |
| Fill Ounces | 38 | 1.4780241 |
| PSC | 33 | 1.2835473 |
| Carb Pressure1 | 32 | 1.2446519 |
| Hyd Pressure4 | 30 | 1.1668611 |
| Carb Pressure | 27 | 1.0501750 |
| Carb Temp | 26 | 1.0112797 |
| PSC Fill | 23 | 0.8945935 |
| Fill Pressure | 22 | 0.8556982 |
| Filler Level | 20 | 0.7779074 |
| Hyd Pressure2 | 15 | 0.5834306 |
| Hyd Pressure3 | 15 | 0.5834306 |
| Temperature | 14 | 0.5445352 |
| Oxygen Filler | 12 | 0.4667445 |
| Pressure Setpoint | 12 | 0.4667445 |
| Hyd Pressure1 | 11 | 0.4278491 |
| Carb Volume | 10 | 0.3889537 |
| Carb Rel | 10 | 0.3889537 |
| Alch Rel | 9 | 0.3500583 |
| Usage cont | 5 | 0.1944769 |
| PH | 4 | 0.1555815 |
| Mnf Flow | 2 | 0.0777907 |
| Carb Flow | 2 | 0.0777907 |
| Bowl Setpoint | 2 | 0.0777907 |
| Density | 1 | 0.0388954 |
| Balling | 1 | 0.0388954 |
| Balling Lvl | 1 | 0.0388954 |
| Pressure Vacuum | 0 | 0.0000000 |
| Air Pressurer | 0 | 0.0000000 |
Notice that ~8.25% of the rows are missing the MFR field - we may need to drop this column. As the percentage of missing values increase, imputing may have negative consequences. The categorical column Brand Code is missing 4.67% of its values. Since we don’t know if this might represent another brand or actual missing data, we will create a new categorical value ‘Unknown’ and assign NA’s to this value. For the rest of the features, we are only missing a small percentage, so we are probably safe with imputing using a KNN approach.
Next, we visualize the distribution profiles for each of the predictor variables. This will help us to make a plan on which variables to include, how they might be related to each other or PH, and finally identify outliers or transformations that might help improve model resolution.
The distribution profiles show the prevalence of kurtosis, specifically right skew in variables Oxygen Filler, …, and left skew in Filler Speed, … . These deviations from a traditional normal distribution can be problematic for linear regression assumptions, and thus we might need to transform the data. Several features are discrete with limited possible values, e.g. Pressure Setpoint. Furthermore, we have a number of bimodel features, see Air Pressurer, Balling, and Balling Level. Bimodal features in a dataset are both problematic and interesting and potentially an area of opportunity and exploration. Bimodal data suggests that there are possibly two different groups or classes within the feature.
Bimodal features are extremely interesting in classification tasks, as they could indicate overlapping but separate distributions for each class, which could provide powerful predictive power in a model.
While we don’t tackle feature engineering in this analysis, if we were performing a more in-depth analysis, we could leverage the package, mixtools (see R Vignette). This package helps regress mixed models where data can be subdivided into subgroups. We could then add new binary features to indicate for each instance, which distribution it belongs.
Here is a quick example showing a possible mix within Air Pressurer:
## number of iterations= 7
Lastly, several features have both a distribution along with a high number of values at an extreme. However, based on the feature meanings and provided information, we have no information on whether these extreme values are mistakes, data errors, or otherwise inexplicable. As such, we will need to review each and to determine whether to impute, leave as-is, or apply feature engineering.
In addition to creating histogram distributions, we also elected to use box-plots to get an idea of the spread of each variable.
The box-plots reveal outliers, however, none of them seem egregious enough to warrant imputing or removal. Outliers should only be dropped or imputed if we have reason to believe they are errant or contain no critical information.
Finally, we generate scatter plots of each variable versus the target variable to get an idea of the relationship between them.
The plots indicate some clear relationships between our target and features, such as PH & Oxygen Filter or PH & Alch Rel. However, we also see clear correlations between some of the features, for example Carb Temp & Carb Pressure. Overall, although our plots indicate some interesting relationships between our variables, they also reveal some significant issues with the data.
For instance, most of the predictor variables are skewed or non-normally distributed, and will need to be transformed. It also appears we have some missing data encoded as 0.
With our outliers data imputed correctly, we can now build plots to quantify the correlations between our target variable and predictor variable. We will want to choose those with stronger positive or negative correlations. Features with correlations closer to zero will probably not provide any meaningful information on explaining crime patterns.
## values ind
## 1 0.361587534 Bowl Setpoint
## 2 0.352043962 Filler Level
## 3 0.233593699 Carb Flow
## 4 0.219735497 Pressure Vacuum
## 5 0.196051481 Carb Rel
## 6 0.166682228 Alch Rel
## 7 0.164485364 Oxygen Filler
## 8 0.109371168 Balling Lvl
## 9 0.098866734 PC Volume
## 10 0.095546936 Density
## 11 0.076700227 Balling
## 12 0.076213407 Carb Pressure
## 13 0.072132509 Carb Volume
## 14 0.032279368 Carb Temp
## 15 -0.007997231 Air Pressurer
## 16 -0.023809796 PSC Fill
## 17 -0.040882953 Filler Speed
## 18 -0.045196477 MFR
## 19 -0.047066423 Hyd Pressure1
## 20 -0.069873041 PSC
## 21 -0.085259857 PSC CO2
## 22 -0.118335902 Fill Ounces
## 23 -0.118764185 Carb Pressure1
## 24 -0.171434026 Hyd Pressure4
## 25 -0.182659650 Temperature
## 26 -0.222660048 Hyd Pressure2
## 27 -0.268101792 Hyd Pressure3
## 28 -0.311663908 Pressure Setpoint
## 29 -0.316514463 Fill Pressure
## 30 -0.357611993 Usage cont
## 31 -0.459231253 Mnf Flow
It appears that Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the highest correlation (positive) with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlation with PH. The other variables have a weak or slightly negative correlation, which implies they have less predictive power.
One problem that can occur with multi-variable regression is a correlation between variables, called Multicolinearity. A quick check is to run correlations between variables.
We can see that some variables are highly correlated with one another, such as Balling Level and Carb Volume, Carb Rel, Alch Rel, Density, and Balling, with a correlation between 0.75 and 1 When we start considering features for our models, we’ll need to account for the correlations between features and avoid including pairs with strong correlations.
As a note, this dataset is challenging as many of the predictive features go hand-in-hand with other features and multicolinearity will be a problem.
Lastly, we want to check for any features that show near zero variance. Features that are the same across most of the instances will add little predictive information.
## freqRatio percentUnique zeroVar nzv
## Hyd Pressure1 31.11111 9.529366 FALSE TRUE
Hyd Pressure1 shows little variance - we will drop this feature.
To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:
MFR has > 8% missing values - remove this feature.Hyd Pressure1 shows little variance - remove this feature.PH that need to be removed.Brand Code with “Unknown”kNN() from the VIM packageNo outliers were removed as all values seemed reasonable.
Brand Code is a categorical variable with values A, B, C, D and Unknown. For modeling, we will convert this to a set of dummy columns.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(dummy_cols)` instead of `dummy_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(eval_dummy_cols)` instead of `eval_dummy_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Finally, as mentioned earlier in our data exploration, and our findings from our histogram plots, we can see that some of our variables are highly skewed. To address this, we decided to scale, center and BoxCox transform (using caret preProcess) to make them more normally distributed.
## Created from 2567 samples and 34 variables
##
## Pre-processing:
## - Box-Cox transformation (22)
## - centered (34)
## - ignored (0)
## - scaled (34)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00000 -1.90000 -0.15000 -0.08182 1.15000 2.00000
Here are some plots to demonstrate the changes in distributions before and after the transformations:
As expected, the dummy variables, e.g. ``Brand Code``A appear are binary and we still have bimodal features as we didn’t apply any feature engineering on them. A few still show skew, e.g. PSC Fill and Temperature, but they are closer to normal.
With our transformations complete, we can now continue on to building our models.
Using the training data, build at least three different models. Since we have multicolinearity, we should select appropriate models intolerate of it or do feature selection
Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter-intuitive? Why? The boss needs to know.
With a solid understanding of our dataset at this point, and with our data cleaned, we can now start to build out candidate models. We will explore …(LR, PLS?, KNN?, SVM?, )
Need answer from Jeff on explanability vs Accuracy … if accuracy, then Neural Net or t-SNE probably better direction. Not all models have varImp() method for variable importance.
First, we decided to split our cleaned dataset into a training and testing set (80% training, 20% testing). This was necessary as the provided holdout evaluation dataset doesn’t provide PH values so we cannot measure our model performance against that dataset.
Using our training dataset, we decided to run a binary logistic regression model that included all non-transformed features that we hadn’t removed following our data cleaning process mentioned above.
##
## Call:
## lm(formula = PH ~ `\`Brand Code\`B` + `\`Brand Code\`C` + `\`Brand Code\`D` +
## `Carb Volume` + `Fill Ounces` + `PC Volume` + `Carb Temp` +
## `PSC Fill` + `PSC CO2` + `Mnf Flow` + `Carb Pressure1` +
## `Fill Pressure` + `Hyd Pressure2` + `Hyd Pressure3` + `Filler Level` +
## Temperature + `Usage cont` + `Carb Flow` + Balling + `Pressure Vacuum` +
## `Oxygen Filler` + `Bowl Setpoint` + `Pressure Setpoint` +
## `Air Pressurer` + `Alch Rel` + `Balling Lvl`, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51348 -0.08048 0.00839 0.08877 0.45361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.544974 0.002891 2956.025 < 2e-16 ***
## `\\`Brand Code\\`B` 0.040026 0.006761 5.920 3.77e-09 ***
## `\\`Brand Code\\`C` -0.019633 0.004653 -4.219 2.56e-05 ***
## `\\`Brand Code\\`D` 0.022077 0.006761 3.266 0.001111 **
## `Carb Volume` -0.013337 0.005292 -2.520 0.011800 *
## `Fill Ounces` -0.005211 0.003058 -1.704 0.088491 .
## `PC Volume` -0.008552 0.003413 -2.505 0.012310 *
## `Carb Temp` 0.004195 0.002972 1.412 0.158232
## `PSC Fill` -0.007392 0.002965 -2.493 0.012736 *
## `PSC CO2` -0.007066 0.003046 -2.320 0.020447 *
## `Mnf Flow` -0.089220 0.006409 -13.921 < 2e-16 ***
## `Carb Pressure1` 0.031882 0.003526 9.041 < 2e-16 ***
## `Fill Pressure` 0.007959 0.004084 1.949 0.051462 .
## `Hyd Pressure2` -0.031719 0.008297 -3.823 0.000136 ***
## `Hyd Pressure3` 0.069485 0.009871 7.040 2.63e-12 ***
## `Filler Level` -0.013279 0.008600 -1.544 0.122753
## Temperature -0.019637 0.003389 -5.794 7.94e-09 ***
## `Usage cont` -0.019739 0.003742 -5.275 1.47e-07 ***
## `Carb Flow` 0.013469 0.003748 3.593 0.000334 ***
## Balling -0.074856 0.013498 -5.546 3.31e-08 ***
## `Pressure Vacuum` -0.009644 0.004204 -2.294 0.021888 *
## `Oxygen Filler` -0.017094 0.004097 -4.172 3.14e-05 ***
## `Bowl Setpoint` 0.050925 0.008912 5.714 1.27e-08 ***
## `Pressure Setpoint` -0.012537 0.004204 -2.982 0.002895 **
## `Air Pressurer` -0.004546 0.003137 -1.449 0.147391
## `Alch Rel` 0.033784 0.012701 2.660 0.007876 **
## `Balling Lvl` 0.064338 0.014774 4.355 1.40e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1309 on 2028 degrees of freedom
## Multiple R-squared: 0.4245, Adjusted R-squared: 0.4171
## F-statistic: 57.53 on 26 and 2028 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 8.539305e+00 8.5506426285
## `\\`Brand Code\\`B` 2.676671e-02 0.0532862240
## `\\`Brand Code\\`C` -2.875792e-02 -0.0105073515
## `\\`Brand Code\\`D` 8.818438e-03 0.0353352793
## `Carb Volume` -2.371499e-02 -0.0029592655
## `Fill Ounces` -1.120801e-02 0.0007855694
## `PC Volume` -1.524565e-02 -0.0018576704
## `Carb Temp` -1.633405e-03 0.0100241565
## `PSC Fill` -1.320598e-02 -0.0015776534
## `PSC CO2` -1.303980e-02 -0.0010926876
## `Mnf Flow` -1.017886e-01 -0.0766506964
## `Carb Pressure1` 2.496595e-02 0.0387971221
## `Fill Pressure` -5.054988e-05 0.0159693398
## `Hyd Pressure2` -4.799160e-02 -0.0154468522
## `Hyd Pressure3` 5.012758e-02 0.0888428224
## `Filler Level` -3.014510e-02 0.0035877963
## Temperature -2.628276e-02 -0.0129903691
## `Usage cont` -2.707649e-02 -0.0124007797
## `Carb Flow` 6.118198e-03 0.0208207020
## Balling -1.013267e-01 -0.0483847881
## `Pressure Vacuum` -1.788759e-02 -0.0013995870
## `Oxygen Filler` -2.512812e-02 -0.0090594646
## `Bowl Setpoint` 3.344606e-02 0.0684032641
## `Pressure Setpoint` -2.078094e-02 -0.0042925572
## `Air Pressurer` -1.069745e-02 0.0016052421
## `Alch Rel` 8.876188e-03 0.0586926219
## `Balling Lvl` 3.536413e-02 0.0933128122
## [1] "VIF scores of predictors"
## `\\`Brand Code\\`B` `\\`Brand Code\\`C` `\\`Brand Code\\`D` `Carb Volume`
## 5.480372 2.656890 5.446772 3.293094
## `Fill Ounces` `PC Volume` `Carb Temp` `PSC Fill`
## 1.119128 1.376066 1.061715 1.063613
## `PSC CO2` `Mnf Flow` `Carb Pressure1` `Fill Pressure`
## 1.059114 4.900991 1.507742 2.035490
## `Hyd Pressure2` `Hyd Pressure3` `Filler Level` Temperature
## 8.280903 11.760305 8.810006 1.382046
## `Usage cont` `Carb Flow` Balling `Pressure Vacuum`
## 1.683663 1.704095 21.463402 2.141367
## `Oxygen Filler` `Bowl Setpoint` `Pressure Setpoint` `Air Pressurer`
## 1.975616 9.311984 2.110772 1.183983
## `Alch Rel` `Balling Lvl`
## 19.159384 26.065167
Applying Model 1 against our Test Data:
INSERT discussion here
Applying Model 2 against our Test Data:
Insert discussion
Applying Model 3 against our Test Data:
Insert discussion
Applying Model 4 against our Test Data:
Insert discussion
Applying Model 5 against our Test Data:
Insert discussion
Applying Model 6 against our Test Data:
Insert discussion and summary
For the model, will you use a metric such as log-likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.
Insert discussion here
We apply Model #N to the holdout evaluation set to predict the targets for these instances. We have saved these predictions as csv in the file eval_predictions.csv.
Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/eval_predictions.csv
# Copy final R code here and hide it up above