DS 624 Fall2020: Project 2

Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/

Instructions

Overview

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Deliverables

Please submit both RPubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Introduction

Insert short overview of what the data contains and what we are trying to accomplish by building a model. Discuss our overall approach and what models might be appropriate.

1. Data Exploration

Describe the size and the variables in the training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below.

Dataset

The dataset contains categorical, continuous and discrete features. The data contains 32 features and 2571 instances, with 267 instances reserved for an evaluation dataset with targets removed. The target column is PH. While PH should be a continuous variable, of note, there are 52 distinct values (over the 2400+ rows). Given this, possible models could include both regression and classification, or an ensemble of both.

There are two files provided:

StudentData.xlsx - dataset we will use to train our model. Note the PH column will be our target we are trying to predict.
StudentEvaluation.xlsx - holdout data used for evaluation. Note the PH column is empty in this dataset - our model will have to be scored by an outside group with knowledge of the actual PH Values.

Note: Both files are simple CSV format.

Below is a list of the variables of interest in the data set:

Brand Code: categorical, values: A, B, C, D
Carb Volume:
Fill Ounces:
PC Volume:
Carb Pressure:
Carb Temp:
PSC:
PSC Fill:
PSC CO2:
Mnf Flow:
Carb Pressure1:
Fill Pressure:
Hyd Pressure1:
Hyd Pressure2:
Hyd Pressure3:
Hyd Pressure4:
Filler Level:
Filler Speed:
Temperature:
Usage cont:
Carb Flow:
Density:
MFR:
Balling:
Pressure Vacuum:
PH: the TARGET we will try to predict.
Bowl Setpoint:
Pressure Setpoint:
Air Pressurer:
Alch Rel:
Carb Rel:
Balling Lvl:

Summary Stats

We compiled summary statistics on our dataset to better understand the data before modeling.

Data summary

Name	df
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

The first observation is that we have quite a few missing data points across our features (coded as NA’s) that we will want to impute. Especially note that 4 rows are missing their PH value. We will need to drop these rows as they cannot be used for training.

Based on the summary statistics, it appears we have some highly skewed features with means that are far from the median indicating a skewed distribution. Some examples include the variables … . We also see a several variables that appears to be quite imbalanced with a large number of 0 values, e.g. Hyd Pressure1 and Hyd Pressure2. We might need to impute these.

Check Target Class Bias

If we treat PH as a classification problem, we need to understand any class imbalance, as this may impact predicted classification.

Our class balance is:

PH is normally distributed with some possible outliers on the low and high ends. Given this distribution, a pure classification approach may be problematic as the predictions may favor pH’s in the mid-range (since we have more data points). That said, it’s still possible there are boundaries such that classification adds predictive information. Give the normal shape, a regression or possibly an ensemble with regression and classification might be more appropriate.

Missing Data

Before continuing, let’s understand any missing data including which features are impacted and any patterns between missing values.

Missing Values
variable	n_miss	pct_miss
MFR	212	8.2458187
Brand Code	120	4.6674446
Filler Speed	57	2.2170362
PC Volume	39	1.5169195
PSC CO2	39	1.5169195
Fill Ounces	38	1.4780241
PSC	33	1.2835473
Carb Pressure1	32	1.2446519
Hyd Pressure4	30	1.1668611
Carb Pressure	27	1.0501750
Carb Temp	26	1.0112797
PSC Fill	23	0.8945935
Fill Pressure	22	0.8556982
Filler Level	20	0.7779074
Hyd Pressure2	15	0.5834306
Hyd Pressure3	15	0.5834306
Temperature	14	0.5445352
Oxygen Filler	12	0.4667445
Pressure Setpoint	12	0.4667445
Hyd Pressure1	11	0.4278491
Carb Volume	10	0.3889537
Carb Rel	10	0.3889537
Alch Rel	9	0.3500583
Usage cont	5	0.1944769
PH	4	0.1555815
Mnf Flow	2	0.0777907
Carb Flow	2	0.0777907
Bowl Setpoint	2	0.0777907
Density	1	0.0388954
Balling	1	0.0388954
Balling Lvl	1	0.0388954
Pressure Vacuum	0	0.0000000
Air Pressurer	0	0.0000000

Notice that ~8.25% of the rows are missing the MFR field - we may need to drop this column. As the percentage of missing values increase, imputing may have negative consequences. The categorical column Brand Code is missing 4.67% of its values. Since we don’t know if this might represent another brand or actual missing data, we will create a new categorical value ‘Unknown’ and assign NA’s to this value. For the rest of the features, we are only missing a small percentage, so we are probably safe with imputing using a KNN approach.

Distributions

Next, we visualize the distribution profiles for each of the predictor variables. This will help us to make a plan on which variables to include, how they might be related to each other or PH, and finally identify outliers or transformations that might help improve model resolution.

The distribution profiles show the prevalence of kurtosis, specifically right skew in variables Oxygen Filler, …, and left skew in Filler Speed, … . These deviations from a traditional normal distribution can be problematic for linear regression assumptions, and thus we might need to transform the data. Several features are discrete with limited possible values, e.g. Pressure Setpoint. Furthermore, we have a number of bimodel features, see Air Pressurer, Balling, and Balling Level. Bimodal features in a dataset are both problematic and interesting and potentially an area of opportunity and exploration. Bimodal data suggests that there are possibly two different groups or classes within the feature.

Bimodal features are extremely interesting in classification tasks, as they could indicate overlapping but separate distributions for each class, which could provide powerful predictive power in a model.

While we don’t tackle feature engineering in this analysis, if we were performing a more in-depth analysis, we could leverage the package, mixtools (see R Vignette). This package helps regress mixed models where data can be subdivided into subgroups. We could then add new binary features to indicate for each instance, which distribution it belongs.

Here is a quick example showing a possible mix within Air Pressurer:

## number of iterations= 7

Lastly, several features have both a distribution along with a high number of values at an extreme. However, based on the feature meanings and provided information, we have no information on whether these extreme values are mistakes, data errors, or otherwise inexplicable. As such, we will need to review each and to determine whether to impute, leave as-is, or apply feature engineering.

Boxplots

In addition to creating histogram distributions, we also elected to use box-plots to get an idea of the spread of each variable.

The box-plots reveal outliers, however, none of them seem egregious enough to warrant imputing or removal. Outliers should only be dropped or imputed if we have reason to believe they are errant or contain no critical information.

Variable Plots

Finally, we generate scatter plots of each variable versus the target variable to get an idea of the relationship between them.

The plots indicate some clear relationships between our target and features, such as PH & Oxygen Filter or PH & Alch Rel. However, we also see clear correlations between some of the features, for example Carb Temp & Carb Pressure. Overall, although our plots indicate some interesting relationships between our variables, they also reveal some significant issues with the data.

For instance, most of the predictor variables are skewed or non-normally distributed, and will need to be transformed. It also appears we have some missing data encoded as 0.

Feature-Target Correlations

With our outliers data imputed correctly, we can now build plots to quantify the correlations between our target variable and predictor variable. We will want to choose those with stronger positive or negative correlations. Features with correlations closer to zero will probably not provide any meaningful information on explaining crime patterns.

##          values               ind
## 1   0.361587534     Bowl Setpoint
## 2   0.352043962      Filler Level
## 3   0.233593699         Carb Flow
## 4   0.219735497   Pressure Vacuum
## 5   0.196051481          Carb Rel
## 6   0.166682228          Alch Rel
## 7   0.164485364     Oxygen Filler
## 8   0.109371168       Balling Lvl
## 9   0.098866734         PC Volume
## 10  0.095546936           Density
## 11  0.076700227           Balling
## 12  0.076213407     Carb Pressure
## 13  0.072132509       Carb Volume
## 14  0.032279368         Carb Temp
## 15 -0.007997231     Air Pressurer
## 16 -0.023809796          PSC Fill
## 17 -0.040882953      Filler Speed
## 18 -0.045196477               MFR
## 19 -0.047066423     Hyd Pressure1
## 20 -0.069873041               PSC
## 21 -0.085259857           PSC CO2
## 22 -0.118335902       Fill Ounces
## 23 -0.118764185    Carb Pressure1
## 24 -0.171434026     Hyd Pressure4
## 25 -0.182659650       Temperature
## 26 -0.222660048     Hyd Pressure2
## 27 -0.268101792     Hyd Pressure3
## 28 -0.311663908 Pressure Setpoint
## 29 -0.316514463     Fill Pressure
## 30 -0.357611993        Usage cont
## 31 -0.459231253          Mnf Flow

It appears that Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the highest correlation (positive) with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlation with PH. The other variables have a weak or slightly negative correlation, which implies they have less predictive power.

Multicollinearity

One problem that can occur with multi-variable regression is a correlation between variables, called Multicolinearity. A quick check is to run correlations between variables.

We can see that some variables are highly correlated with one another, such as Balling Level and Carb Volume, Carb Rel, Alch Rel, Density, and Balling, with a correlation between 0.75 and 1 When we start considering features for our models, we’ll need to account for the correlations between features and avoid including pairs with strong correlations.

As a note, this dataset is challenging as many of the predictive features go hand-in-hand with other features and multicolinearity will be a problem.

Near Zero Variance

Lastly, we want to check for any features that show near zero variance. Features that are the same across most of the instances will add little predictive information.

##               freqRatio percentUnique zeroVar  nzv
## Hyd Pressure1  31.11111      9.529366   FALSE TRUE

Hyd Pressure1 shows little variance - we will drop this feature.

2. Data Preparation

To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:

Removed Fields

MFR has > 8% missing values - remove this feature.
Hyd Pressure1 shows little variance - remove this feature.

Missing Values

We had 4 rows with missing PH that need to be removed.
Replace missing Brand Code with “Unknown”
Impute remaining missing values using kNN() from the VIM package

Outliers

No outliers were removed as all values seemed reasonable.

Convert Categorical to Dummy

Brand Code is a categorical variable with values A, B, C, D and Unknown. For modeling, we will convert this to a set of dummy columns.

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(dummy_cols)` instead of `dummy_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(eval_dummy_cols)` instead of `eval_dummy_cols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

Transform non-normal variables

Finally, as mentioned earlier in our data exploration, and our findings from our histogram plots, we can see that some of our variables are highly skewed. To address this, we decided to scale, center and BoxCox transform (using caret preProcess) to make them more normally distributed.

## Created from 2567 samples and 34 variables
## 
## Pre-processing:
##   - Box-Cox transformation (22)
##   - centered (34)
##   - ignored (0)
##   - scaled (34)
## 
## Lambda estimates for Box-Cox transformation:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.00000 -1.90000 -0.15000 -0.08182  1.15000  2.00000

Here are some plots to demonstrate the changes in distributions before and after the transformations:

As expected, the dummy variables, e.g. ``Brand Code``A appear are binary and we still have bimodal features as we didn’t apply any feature engineering on them. A few still show skew, e.g. PSC Fill and Temperature, but they are closer to normal.

Finalizing the dataset for model building

With our transformations complete, we can now continue on to building our models.

3. Build Models

Using the training data, build at least three different models. Since we have multicolinearity, we should select appropriate models intolerate of it or do feature selection

Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter-intuitive? Why? The boss needs to know.

Model-building methodology

With a solid understanding of our dataset at this point, and with our data cleaned, we can now start to build out candidate models. We will explore …(LR, PLS?, KNN?, SVM?, )

Need answer from Jeff on explanability vs Accuracy … if accuracy, then Neural Net or t-SNE probably better direction. Not all models have varImp() method for variable importance.

First, we decided to split our cleaned dataset into a training and testing set (80% training, 20% testing). This was necessary as the provided holdout evaluation dataset doesn’t provide PH values so we cannot measure our model performance against that dataset.

Model #1 (Multi-LM)

Using our training dataset, we decided to run a binary logistic regression model that included all non-transformed features that we hadn’t removed following our data cleaning process mentioned above.

## 
## Call:
## lm(formula = PH ~ `\`Brand Code\`B` + `\`Brand Code\`C` + `\`Brand Code\`D` + 
##     `Carb Volume` + `Fill Ounces` + `PC Volume` + `Carb Temp` + 
##     `PSC Fill` + `PSC CO2` + `Mnf Flow` + `Carb Pressure1` + 
##     `Fill Pressure` + `Hyd Pressure2` + `Hyd Pressure3` + `Filler Level` + 
##     Temperature + `Usage cont` + `Carb Flow` + Balling + `Pressure Vacuum` + 
##     `Oxygen Filler` + `Bowl Setpoint` + `Pressure Setpoint` + 
##     `Air Pressurer` + `Alch Rel` + `Balling Lvl`, data = df_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51348 -0.08048  0.00839  0.08877  0.45361 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          8.544974   0.002891 2956.025  < 2e-16 ***
## `\\`Brand Code\\`B`  0.040026   0.006761    5.920 3.77e-09 ***
## `\\`Brand Code\\`C` -0.019633   0.004653   -4.219 2.56e-05 ***
## `\\`Brand Code\\`D`  0.022077   0.006761    3.266 0.001111 ** 
## `Carb Volume`       -0.013337   0.005292   -2.520 0.011800 *  
## `Fill Ounces`       -0.005211   0.003058   -1.704 0.088491 .  
## `PC Volume`         -0.008552   0.003413   -2.505 0.012310 *  
## `Carb Temp`          0.004195   0.002972    1.412 0.158232    
## `PSC Fill`          -0.007392   0.002965   -2.493 0.012736 *  
## `PSC CO2`           -0.007066   0.003046   -2.320 0.020447 *  
## `Mnf Flow`          -0.089220   0.006409  -13.921  < 2e-16 ***
## `Carb Pressure1`     0.031882   0.003526    9.041  < 2e-16 ***
## `Fill Pressure`      0.007959   0.004084    1.949 0.051462 .  
## `Hyd Pressure2`     -0.031719   0.008297   -3.823 0.000136 ***
## `Hyd Pressure3`      0.069485   0.009871    7.040 2.63e-12 ***
## `Filler Level`      -0.013279   0.008600   -1.544 0.122753    
## Temperature         -0.019637   0.003389   -5.794 7.94e-09 ***
## `Usage cont`        -0.019739   0.003742   -5.275 1.47e-07 ***
## `Carb Flow`          0.013469   0.003748    3.593 0.000334 ***
## Balling             -0.074856   0.013498   -5.546 3.31e-08 ***
## `Pressure Vacuum`   -0.009644   0.004204   -2.294 0.021888 *  
## `Oxygen Filler`     -0.017094   0.004097   -4.172 3.14e-05 ***
## `Bowl Setpoint`      0.050925   0.008912    5.714 1.27e-08 ***
## `Pressure Setpoint` -0.012537   0.004204   -2.982 0.002895 ** 
## `Air Pressurer`     -0.004546   0.003137   -1.449 0.147391    
## `Alch Rel`           0.033784   0.012701    2.660 0.007876 ** 
## `Balling Lvl`        0.064338   0.014774    4.355 1.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1309 on 2028 degrees of freedom
## Multiple R-squared:  0.4245, Adjusted R-squared:  0.4171 
## F-statistic: 57.53 on 26 and 2028 DF,  p-value: < 2.2e-16

##                             2.5 %        97.5 %
## (Intercept)          8.539305e+00  8.5506426285
## `\\`Brand Code\\`B`  2.676671e-02  0.0532862240
## `\\`Brand Code\\`C` -2.875792e-02 -0.0105073515
## `\\`Brand Code\\`D`  8.818438e-03  0.0353352793
## `Carb Volume`       -2.371499e-02 -0.0029592655
## `Fill Ounces`       -1.120801e-02  0.0007855694
## `PC Volume`         -1.524565e-02 -0.0018576704
## `Carb Temp`         -1.633405e-03  0.0100241565
## `PSC Fill`          -1.320598e-02 -0.0015776534
## `PSC CO2`           -1.303980e-02 -0.0010926876
## `Mnf Flow`          -1.017886e-01 -0.0766506964
## `Carb Pressure1`     2.496595e-02  0.0387971221
## `Fill Pressure`     -5.054988e-05  0.0159693398
## `Hyd Pressure2`     -4.799160e-02 -0.0154468522
## `Hyd Pressure3`      5.012758e-02  0.0888428224
## `Filler Level`      -3.014510e-02  0.0035877963
## Temperature         -2.628276e-02 -0.0129903691
## `Usage cont`        -2.707649e-02 -0.0124007797
## `Carb Flow`          6.118198e-03  0.0208207020
## Balling             -1.013267e-01 -0.0483847881
## `Pressure Vacuum`   -1.788759e-02 -0.0013995870
## `Oxygen Filler`     -2.512812e-02 -0.0090594646
## `Bowl Setpoint`      3.344606e-02  0.0684032641
## `Pressure Setpoint` -2.078094e-02 -0.0042925572
## `Air Pressurer`     -1.069745e-02  0.0016052421
## `Alch Rel`           8.876188e-03  0.0586926219
## `Balling Lvl`        3.536413e-02  0.0933128122

## [1] "VIF scores of predictors"

## `\\`Brand Code\\`B` `\\`Brand Code\\`C` `\\`Brand Code\\`D`       `Carb Volume` 
##            5.480372            2.656890            5.446772            3.293094 
##       `Fill Ounces`         `PC Volume`         `Carb Temp`          `PSC Fill` 
##            1.119128            1.376066            1.061715            1.063613 
##           `PSC CO2`          `Mnf Flow`    `Carb Pressure1`     `Fill Pressure` 
##            1.059114            4.900991            1.507742            2.035490 
##     `Hyd Pressure2`     `Hyd Pressure3`      `Filler Level`         Temperature 
##            8.280903           11.760305            8.810006            1.382046 
##        `Usage cont`         `Carb Flow`             Balling   `Pressure Vacuum` 
##            1.683663            1.704095           21.463402            2.141367 
##     `Oxygen Filler`     `Bowl Setpoint` `Pressure Setpoint`     `Air Pressurer` 
##            1.975616            9.311984            2.110772            1.183983 
##          `Alch Rel`       `Balling Lvl` 
##           19.159384           26.065167

Applying Model 1 against our Test Data:

INSERT discussion here

Model #2 (PLS?, Ridge?, ENET?)

Applying Model 2 against our Test Data:

Model #3? (Neural Network - Regression)

Insert discussion

Applying Model 3 against our Test Data:

Model #4? (kNN - Classification)

Insert discussion

Applying Model 4 against our Test Data:

Model #5? (XGBoost - Classification?)

Insert discussion

Applying Model 5 against our Test Data:

Model #6? (Kera DL NN)

Insert discussion

Applying Model 6 against our Test Data:

Model Summary

Insert discussion and summary

4. Model Selection & Analysis

For the model, will you use a metric such as log-likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

Insert discussion here

Predictions

We apply Model #N to the holdout evaluation set to predict the targets for these instances. We have saved these predictions as csv in the file eval_predictions.csv.

Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/eval_predictions.csv

References

A Modern Approach to Regression with R: Simon Sheather
Linear Models with R: Julian Faraway.
R package vignette, mixtools: An R Package for Analyzing Finite Mixture Models
7 Classic OLS assumptions
Detecting Multicolinearity with VIF
Applied Predictive Modeling: Kuhn & Johnson

Appendix

R Code

# Copy final R code here and hide it up above

DS 624 Fall2020: Project 2

ABC Beverae - PH Analysis

Donny Lofland, Dennis Pong, Charlie Rosemond

Instructions

Overview

Deliverables

Introduction

1. Data Exploration

Dataset

Summary Stats

Check Target Class Bias

Missing Data

Distributions

Boxplots

Variable Plots

Feature-Target Correlations

Multicollinearity

Near Zero Variance

2. Data Preparation

Removed Fields

Missing Values

Outliers

Convert Categorical to Dummy

Transform non-normal variables

Finalizing the dataset for model building

3. Build Models

Model-building methodology

Model #1 (Multi-LM)

Model #2 (PLS?, Ridge?, ENET?)

Model #3? (Neural Network - Regression)

Model #4? (kNN - Classification)

Model #5? (XGBoost - Classification?)

Model #6? (Kera DL NN)

Model Summary

4. Model Selection & Analysis

Predictions

References

Appendix

R Code