DS 624 Fall2020: Project 2

Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/

Instructions

Overview

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Deliverables

Please submit both RPubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Introduction

Our team’s analysis seeks to build understanding of the ABC Beverage manufacturing process and the related factors that affect the pH of the company’s beverage products. Our goal is to build a model that both predicts product hP, given manufacturing steps and identify which steps appear to have the most impact on pH.

We have been provided with historic data for product batches including data on each manufacturing step along with the final measured pH. We will start by understanding the dataset. Specifically are the any missing data, outliers or odd feature distributions that might complicate modeling. We will then do any necessary data cleaning, split our data into training and testing set so we can more accurately determine model performance on out-of-set data samples. We will preform a number of different machine learning approaches, touching on different broad prediction approaches including: linear regression, multiple regression, penalized regression, non-linear regression, tree-based, and neural network. Different methodologies can perform better depending on the nature of the data, so it makes sense to try a number of approaches and choose the one that best handles our specific dataset. We will then choose the model that performs best and use that to predict final pH on a holdout evaluation dataset.

This model could help the company adapt its processes in a changing regulatory environment.

Note - we are doing an observational study so any correlations we identify would need to be followed up with testing to identify causal relationships.

1. Data Exploration

Dataset

The training data set contains 32 categorical, continuous, or discrete features and 2571 rows, with 267 rows reserved for an evaluation set that lacks the target. That target is PH, which should be a continuous variable but has 52 distinct values in the training set. As a result, possible predictive models could include regression, classification, or an ensemble of both.

There are two files provided:

StudentData.xlsx - The data set we use to train our model. It contains PH, the feature we seek to predict.
StudentEvaluation.xlsx - The data set we use to evaluate our model. It lacks PH. Our model will have to be scored by an outside group with knowledge of the actual pH values.

Note: Both Excel files are in simple CSV format.

# Load crime dataset
df <- read_excel('datasets/StudentData.xlsx')
df_eval <- read_excel('datasets/StudentEvaluation.xlsx')

# remove the empty PH column from the evaluation data
df_eval <- df_eval %>%
  dplyr::select(-PH)

Below is a list of the variables of interest in the data set:

Brand Code: categorical, values: A, B, C, D
Carb Volume:
Fill Ounces:
PC Volume:
Carb Pressure:
Carb Temp:
PSC:
PSC Fill:
PSC CO2:
Mnf Flow:
Carb Pressure1:
Fill Pressure:
Hyd Pressure1:
Hyd Pressure2:
Hyd Pressure3:
Hyd Pressure4:
Filler Level:
Filler Speed:
Temperature:
Usage cont:
Carb Flow:
Density:
MFR:
Balling:
Pressure Vacuum:
PH: the TARGET we will try to predict.
Bowl Setpoint:
Pressure Setpoint:
Air Pressurer:
Alch Rel:
Carb Rel:
Balling Lvl:

Summary Stats

We compiled summary statistics on our dataset to better understand the data before modeling.

# Display summary statistics
skim(df)

Data summary

Name	df
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

First, across features, there are numerous missing data–coded as NA–that will need to be imputed. Especially note that 4 rows are missing a PH value. We will need to drop these rows as they cannot be used for training. Second, the basic histograms suggest that skewness is prevalent across features. Examples include PSC CO2 and MFR. And third, some of the skewed features appear to show near-zero variance, with a large number of 0 or even negative values, e.g. Hyd Pressure1 and Hyd Pressure2. In general, the skewness and imbalance may require imputation.

Check Target Bias

If our target, PH is particularly skewed, it could lead to biased predictions.

hist(df$PH)

PH is normally distributed with possible outliers on the low and high ends. This distribution suggests a pure classification approach could be problematic as the predictions may favor pH values in the mid-range (where there are more data points). Natural boundaries may exist such that classification adds predictive information. However, given the normal shape, a regression or possible ensemble with regression and classification seems more appropriate. Also, we note that models may have more problems predicting pH values in the extremes. There are fewer observations at the low an high pH which means less information to help a model tune for these regions.

Missing Data

Before continuing, let us better understand any patterns of missingness across predictor features.

Missing Values
variable	n_miss	pct_miss
MFR	212	8.2458187
Brand Code	120	4.6674446
Filler Speed	57	2.2170362
PC Volume	39	1.5169195
PSC CO2	39	1.5169195
Fill Ounces	38	1.4780241
PSC	33	1.2835473
Carb Pressure1	32	1.2446519
Hyd Pressure4	30	1.1668611
Carb Pressure	27	1.0501750
Carb Temp	26	1.0112797
PSC Fill	23	0.8945935
Fill Pressure	22	0.8556982
Filler Level	20	0.7779074
Hyd Pressure2	15	0.5834306
Hyd Pressure3	15	0.5834306
Temperature	14	0.5445352
Oxygen Filler	12	0.4667445
Pressure Setpoint	12	0.4667445
Hyd Pressure1	11	0.4278491
Carb Volume	10	0.3889537
Carb Rel	10	0.3889537
Alch Rel	9	0.3500583
Usage cont	5	0.1944769
PH	4	0.1555815
Mnf Flow	2	0.0777907
Carb Flow	2	0.0777907
Bowl Setpoint	2	0.0777907
Density	1	0.0388954
Balling	1	0.0388954
Balling Lvl	1	0.0388954
Pressure Vacuum	0	0.0000000
Air Pressurer	0	0.0000000

Notice that approximately 8.25 percent of the rows are missing a value for MFR. We may need to drop this feature considering that, as missingness increases, so do the potential negative consequences of imputation. Additionally, the categorical feature Brand Code is missing approximately 4.67 percent of its values. Since we do not know whether these values represent another brand or are actually missing, we will create a new feature category ‘Unknown’ consisting of missing values. The rest of the features are only missing small percentages of values, suggesting that KNN imputation should be safe.

Distributions

Next, we visualize the distributions of each of the predictor features. The visuals will help us select features for modeling, assess relationships between features and with PH, and identify outliers as well as transformations that might improve model resolution.

The distribution profiles show the prevalence of kurtosis, specifically right skew in variables Oxygen Filler, PSC, and Temperature and left skew in Filler Speed and MFR. These deviations from a traditional normal distribution can be problematic for linear regression assumptions, and thus we might need to transform the data. Several features are discrete with limited possible values, e.g. Pressure Setpoint. Furthermore, we have a number of bimodel features–see Air Pressurer, Balling, and Balling Level.

Bimodal features in a dataset are problematic but interesting, representing areas of potential opportunity and exploration. They suggest the existence of two different groups, or classes, within a given feature. These groups may have separate but overlapping distributions that could provide powerful predictive power in a model.

Were we tackling in-depth feature engineering in this analysis, we could leverage the package, mixtools (see R Vignette). This package helps regress mixed models where data can be subdivided into subgroups. We could then add new binary features to indicate for each instance, the distribution to which it belongs.

Here is a quick example showing a possible mix within Air Pressurer:

# Select `Air Pressurer` column and remove any missing data
df_mix <- df %>% 
  dplyr::select(`Air Pressurer`) %>%
  tidyr::drop_na()

# Calculate mixed distributions for indus
air_pressure_mix <- normalmixEM(df_mix$`Air Pressurer`, 
                            lambda = .5, 
                            mu = c(140, 148), 
                            sigma = 1, 
                            maxit=60)

## number of iterations= 7

# Simple plot to illustrate possible bimodal mix of groups
plot(air_pressure_mix, 
     whichplots = 2,
     density = TRUE, 
     main2 = "`Air Pressurer` Possible Distributions", 
     xlab2 = "Air Pressurer")

Lastly, several features have relatively normal distributions along with high numbers of values at an extreme. We have no information on whether these extreme values are mistakes, data errors, or otherwise inexplicable. As such, we will need to review each associated feature to determine whether to impute the values, leave them as is, or apply feature engineering.

Boxplots

We also elected to use boxplots to understand the spread of each feature.

The boxplots reveal outliers, though none of them seem egregious enough to warrant imputing or removal. Outliers should only be imputed or dropped if we have reason to believe they are errant or contain no critical information.

Variable Plots

Next, we generate scatter plots of each predictor versus the target to get an idea of the relationship between them.

The scatter plots indicate some clear relationships between our target and predictor features, such as PH and Oxygen Filter or PH and Alch Rel. However, we also see clear correlations between some of the predictors, like Carb Temp and Carb Pressure. Overall, although our plots indicate some interesting relationships, they also underline the aforementioned possible issues with the data. For instance, many predictors have skewed distributions, and in some cases, missing data may be recorded as ‘0’.

Feature-Target Correlations

We next quantify the relationships visualized above. In general, our model should focus on features showing stronger positive or negative correlations with PH. Features with correlations closer to zero will probably not provide any meaningful information on pH levels.

##          values               ind
## 1   0.361587534     Bowl Setpoint
## 2   0.352043962      Filler Level
## 3   0.233593699         Carb Flow
## 4   0.219735497   Pressure Vacuum
## 5   0.196051481          Carb Rel
## 6   0.166682228          Alch Rel
## 7   0.164485364     Oxygen Filler
## 8   0.109371168       Balling Lvl
## 9   0.098866734         PC Volume
## 10  0.095546936           Density
## 11  0.076700227           Balling
## 12  0.076213407     Carb Pressure
## 13  0.072132509       Carb Volume
## 14  0.032279368         Carb Temp
## 15 -0.007997231     Air Pressurer
## 16 -0.023809796          PSC Fill
## 17 -0.040882953      Filler Speed
## 18 -0.045196477               MFR
## 19 -0.047066423     Hyd Pressure1
## 20 -0.069873041               PSC
## 21 -0.085259857           PSC CO2
## 22 -0.118335902       Fill Ounces
## 23 -0.118764185    Carb Pressure1
## 24 -0.171434026     Hyd Pressure4
## 25 -0.182659650       Temperature
## 26 -0.222660048     Hyd Pressure2
## 27 -0.268101792     Hyd Pressure3
## 28 -0.311663908 Pressure Setpoint
## 29 -0.316514463     Fill Pressure
## 30 -0.357611993        Usage cont
## 31 -0.459231253          Mnf Flow

It appears that Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the highest correlations (positive) with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlations with PH. The other features have a weak or slightly negative correlation, which implies they have less predictive power.

Multicollinearity

One problem that can occur with multiple regression is a correlation between predictive features, or multicollinearity. A quick check is to run correlations between all predictors.

We can see that some variables are highly correlated with one another, such as Balling Level and Carb Volume, Carb Rel, Alch Rel, Density, and Balling, with a correlation between 0.75 and 1. When we start considering features for our models, we’ll need to account for the correlations between features and avoid including pairs with strong correlations.

As a note, this dataset is challenging as many of the predictive features go hand-in-hand with other features and multicollinearity will be a problem.

Near-Zero Variance

Lastly, we want to check for any features that show near zero-variance. Features that are the same across most of the instances will add little predictive information.

nzv <- nearZeroVar(df, saveMetrics= TRUE)
nzv[nzv$nzv,][1:5,] %>% drop_na()

##               freqRatio percentUnique zeroVar  nzv
## Hyd Pressure1  31.11111      9.529366   FALSE TRUE

Hyd Pressure1 displays near-zero variance. We will drop this feature prior to modeling.

2. Data Preparation

To summarize our data preparation and exploration, we distinguish our findings into a few categories below.

Removed Fields

MFR has more than 8% missing values - remove this feature.
Hyd Pressure1 shows little variance - remove this feature.

# Remove the fields from our training data
df_clean <- df %>%
  dplyr::select(-c(MFR, `Hyd Pressure1`))

# remove the fields from our evaluation data
df_eval_clean <- df_eval %>%
  dplyr::select(-c(MFR, `Hyd Pressure1`))

Missing Values

We had 4 rows with missing PH that need to be removed.
We replace missing values for Brand Code with “Unknown”.
Impute remaining missing values using kNN() from the VIM package
We then impute remaining missing values using kNN() from the VIM package.

set.seed(181)

# drop rows with missing PH
df_clean <- df_clean %>%
  filter(!is.na(PH))

# Change Brand Code missing to 'Unknown' in our training data
brand_code <- df_clean %>%
  dplyr::select(`Brand Code`) %>%
  replace_na(list(`Brand Code` = 'Unknown'))

df_clean$`Brand Code` <- brand_code$`Brand Code`

# Change Brand Code missing to 'Unknown' in our evaluation data
brand_code <- df_eval_clean %>%
  dplyr::select(`Brand Code`) %>%
  replace_na(list(`Brand Code` = 'Unknown'))

df_eval_clean$`Brand Code` <- df_eval_clean$`Brand Code`

# There is an edge case where our Eval data might have a `Brand Code` not seen in our training set.
# If so, let's convert them to 'Unknown'.  This is appropriate since any model trained without the
# new value wouldn't be able to glean any info from it.
codes <- unique(df_clean$`Brand Code`)

df_eval_clean <- df_eval_clean %>%
  mutate(`Brand Code`  = if_else(`Brand Code` %in% codes, `Brand Code`, 'Unknown'))

# Use the kNN imputing method from VIM package to impute missing values in our training data
df_clean <- df_clean %>% 
  kNN(k=10) %>%
  dplyr::select(colnames(df_clean))

# Use the kNN imputing method from VIM package to impute missing values in our training data
df_eval_clean <- df_eval_clean %>% 
  kNN(k=10) %>%
  dplyr::select(colnames(df_eval_clean))

Outliers

We do not drop any outliers given all values seem reasonable.

Convert Categorical to Dummy

Brand Code is a categorical variable with values A, B, C, D and Unknown. We convert it to a set of dummy columns for modeling.

# -----
# Training data - Convert our `Brand Code` column into a set of dummy variables
df_clean_dummy <- dummyVars(PH ~ `Brand Code`, data = df_clean)
dummies <- predict(df_clean_dummy, df_clean)

# Get the dummy column names
dummy_cols <- sort(colnames(dummies))

# Make sure the new dummy columns are sorted in alpha order (to make sure our columns will match the eval dataset)
dummies <- as.tibble(dummies) %>%
  dplyr::select(dummy_cols)

# remove the original categorical feature
df_clean <- df_clean %>%
  dplyr::select(-`Brand Code`)

# add the new dummy columns to our main training dataframe
df_clean <- cbind(dummies, df_clean)

# -----
# Evaluation data - Convert our `Brand Code` column into a set of dummy variables
#df_eval_clean <- dummyVars(PH ~ `Brand Code`, data = df_eval_clean)
df_eval_clean$PH <- 1
eval_dummies <- predict(df_clean_dummy, df_eval_clean)

# Edge Case - if the eval dataset is doesn't have a specific `Brand Code`
# we will be missing the necessary dummy column.  Let's check and if necessary add 
# appropriate dummy columns with all 0's.

for (c in dummy_cols) {
  if (!(c %in% colnames(eval_dummies))) {
    eval_dummies[c] <- 0
  }
}

# Now sort the eval_dummy columns so they match the training set dummies
eval_dummy_cols <- sort(colnames(eval_dummies))
eval_dummies <- as.tibble(eval_dummies) %>%
  dplyr::select(eval_dummy_cols)

# remove the original categorical feature
df_eval_clean <- df_eval_clean %>%
  dplyr::select(-c(`Brand Code`, PH))

# add the new dummy columns to our main eval dataframe
df_eval_clean <- cbind(eval_dummies, df_eval_clean)

Transform features with skewed distributions

Finally, as mentioned earlier in our data exploration, and our findings from our histogram plots, we can see that some of our features are highly skewed. To address this skewness, we scale, center, and apply the Box-Cox transformation to the skewed features using preProcess from caret. These transformations should result in distributions that better approximate normal and thus facilitate modeling.

## Created from 2567 samples and 34 variables
## 
## Pre-processing:
##   - Box-Cox transformation (22)
##   - centered (34)
##   - ignored (0)
##   - scaled (34)
## 
## Lambda estimates for Box-Cox transformation:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.00000 -1.90000 -0.15000 -0.08182  1.15000  2.00000

Here are some plots to demonstrate the changes in distributions after the transformations:

# Prepare data for ggplot
gather_df <- df_transformed %>% 
  dplyr::select(-c(PH)) %>%
  gather(key = 'variable', value = 'value')

# Histogram plots of each variable
ggplot(gather_df) + 
  geom_histogram(aes(x=value, y = ..density..), bins=30) + 
  geom_density(aes(x=value), color='blue') +
  facet_wrap(. ~variable, scales='free', ncol=4)

As expected, the dummy variables, e.g. ``Brand Code``A, appear binary. We still have bimodal features since we did not apply any feature engineering to address them. A few features, including PSC Fill and Temperature, still show skew, but they seem closer to normal. Our transformations are complete, and we can continue on to building our models.

3. Build Models

With a now solid understanding of our dataset, and with our data cleaned, we can now start to build candidate models. First, we split our cleaned dataset into training and testing sets (80% training, 20% testing). This split is necessary as the provided evaluation data set does not provide PH values, meaning we cannot measure our model performance against that dataset.

set.seed(181)

# utilizing one dataset for all four models
training_set <- createDataPartition(df_transformed$PH, p=0.8, list=FALSE)
df_train <- df_transformed[training_set,]
df_test <- df_transformed[-training_set,]

Model 1 - Multiple Linear Regression

Using our training dataset, we build a multiple linear regression model that regresses PH on, initially, all of the features not removed during data preparation. We then use a stepwise process to home in on solely the most significant features.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)

# Model 1 - Build Multi-Linear Regression
model1_raw <- lm(PH ~ ., df_train)

# Build model 1 - only significant features (using stepAIC)
model1 <- stepAIC(model1_raw, direction = "both",
                         scope = list(upper = model1_raw, lower = ~ 1),
                         scale = 0, trace = FALSE)

stopCluster(cl)

# Display Model 1 Summary
(lmf_s <- summary(model1))

## 
## Call:
## lm(formula = PH ~ `\`Brand Code\`A` + `\`Brand Code\`B` + `\`Brand Code\`C` + 
##     `Fill Ounces` + `PC Volume` + PSC + `PSC CO2` + `Mnf Flow` + 
##     `Carb Pressure1` + `Fill Pressure` + `Hyd Pressure2` + `Hyd Pressure3` + 
##     `Filler Level` + Temperature + `Usage cont` + `Carb Flow` + 
##     Density + Balling + `Pressure Vacuum` + `Oxygen Filler` + 
##     `Bowl Setpoint` + `Pressure Setpoint` + `Alch Rel` + `Balling Lvl`, 
##     data = df_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50945 -0.07982  0.00982  0.08504  0.43065 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)          8.547150   0.002915 2932.148  < 2e-16 ***
## `\\`Brand Code\\`A` -0.016743   0.004227   -3.961 7.73e-05 ***
## `\\`Brand Code\\`B`  0.029783   0.006849    4.348 1.44e-05 ***
## `\\`Brand Code\\`C` -0.024726   0.004844   -5.104 3.63e-07 ***
## `Fill Ounces`       -0.005894   0.003064   -1.924 0.054533 .  
## `PC Volume`         -0.008119   0.003577   -2.270 0.023313 *  
## PSC                 -0.006766   0.003069   -2.205 0.027584 *  
## `PSC CO2`           -0.006481   0.002972   -2.180 0.029353 *  
## `Mnf Flow`          -0.086780   0.006344  -13.679  < 2e-16 ***
## `Carb Pressure1`     0.030781   0.003577    8.605  < 2e-16 ***
## `Fill Pressure`      0.011077   0.004300    2.576 0.010066 *  
## `Hyd Pressure2`     -0.023931   0.008542   -2.802 0.005131 ** 
## `Hyd Pressure3`      0.062535   0.010064    6.214 6.25e-10 ***
## `Filler Level`      -0.016405   0.009345   -1.755 0.079332 .  
## Temperature         -0.017323   0.003320   -5.217 2.00e-07 ***
## `Usage cont`        -0.019289   0.003768   -5.120 3.35e-07 ***
## `Carb Flow`          0.009855   0.003864    2.551 0.010824 *  
## Density             -0.020765   0.009086   -2.285 0.022401 *  
## Balling             -0.088497   0.014300   -6.189 7.32e-10 ***
## `Pressure Vacuum`   -0.014525   0.004279   -3.394 0.000702 ***
## `Oxygen Filler`     -0.011546   0.004079   -2.830 0.004694 ** 
## `Bowl Setpoint`      0.053075   0.009502    5.586 2.64e-08 ***
## `Pressure Setpoint` -0.017965   0.004387   -4.095 4.38e-05 ***
## `Alch Rel`           0.027232   0.011035    2.468 0.013678 *  
## `Balling Lvl`        0.107707   0.016128    6.678 3.11e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1319 on 2030 degrees of freedom
## Multiple R-squared:  0.4153, Adjusted R-squared:  0.4084 
## F-statistic: 60.08 on 24 and 2030 DF,  p-value: < 2.2e-16

confint(model1)

##                            2.5 %        97.5 %
## (Intercept)          8.541432972  8.5528662976
## `\\`Brand Code\\`A` -0.025032457 -0.0084526361
## `\\`Brand Code\\`B`  0.016350785  0.0432148570
## `\\`Brand Code\\`C` -0.034225499 -0.0152255359
## `Fill Ounces`       -0.011902406  0.0001147270
## `PC Volume`         -0.015132935 -0.0011046087
## PSC                 -0.012783804 -0.0007475681
## `PSC CO2`           -0.012309755 -0.0006513135
## `Mnf Flow`          -0.099222383 -0.0743385016
## `Carb Pressure1`     0.023765704  0.0377968490
## `Fill Pressure`      0.002643829  0.0195097510
## `Hyd Pressure2`     -0.040682619 -0.0071802953
## `Hyd Pressure3`      0.042798604  0.0822704157
## `Filler Level`      -0.034731970  0.0019220412
## Temperature         -0.023834539 -0.0108108559
## `Usage cont`        -0.026677337 -0.0119001011
## `Carb Flow`          0.002277800  0.0174320499
## Density             -0.038584542 -0.0029450834
## Balling             -0.116541117 -0.0604526406
## `Pressure Vacuum`   -0.022916789 -0.0061323473
## `Oxygen Filler`     -0.019546262 -0.0035462859
## `Bowl Setpoint`      0.034440071  0.0717092181
## `Pressure Setpoint` -0.026567991 -0.0093622248
## `Alch Rel`           0.005590770  0.0488734326
## `Balling Lvl`        0.076077342  0.1393359230

# Display Model 1 Residual plots
residPlot(lmf_s)

# Display Variable feature importance plot
variableImportancePlot(model1, "Model 1 LM Variable Importance")

# print variable inflation factor score
print('VIF scores of predictors')

## [1] "VIF scores of predictors"

# Calculates the variation inflation factors of all predictors in regression models
VIF(model1)

## `\\`Brand Code\\`A` `\\`Brand Code\\`B` `\\`Brand Code\\`C`       `Fill Ounces` 
##            2.112206            5.538897            2.716281            1.101626 
##         `PC Volume`                 PSC           `PSC CO2`          `Mnf Flow` 
##            1.479643            1.103823            1.030926            4.760599 
##    `Carb Pressure1`     `Fill Pressure`     `Hyd Pressure2`     `Hyd Pressure3` 
##            1.525962            2.187856            8.628236           12.049196 
##      `Filler Level`         Temperature        `Usage cont`         `Carb Flow` 
##           10.325127            1.340368            1.690024            1.803029 
##             Density             Balling   `Pressure Vacuum`     `Oxygen Filler` 
##            9.766020           24.149939            2.122035            1.981706 
##     `Bowl Setpoint` `Pressure Setpoint`          `Alch Rel`       `Balling Lvl` 
##           10.724204            2.281632           14.349801           30.617008

Applying Model 1 against our Test Data:

# Predict df_test and calculate performance
model1_pred <- predict(model1, df_test)

# Merge the results into a data frame called results 
results <- data.frame()
results <- data.frame(t(postResample(pred = model1_pred, obs = df_test$PH))) %>% 
  mutate(Model = "Mutiple Regression") %>% 
  rbind(results)

Model 2 - Ridge Regression

We employ ridge regression for our second model. Ridge regression uses shrinkage to penalizes feature estimates in efforts to control possible inflation due to multicollinearity. Given the noted presence of high correlations across features in this dataset, ridge seems like an appropriate modeling approach.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)

# to find the right lambda using cv.glmnet
x_train <- model.matrix(PH ~ ., data = df_train)
x_test <- model.matrix(PH ~ ., data = df_test)

cv.glmnet <- cv.glmnet(x_train, df_train$PH, alpha = 0)

ridge_model <- glmnet(x_train, df_train$PH, alpha = 0, lambda = cv.glmnet$lambda.min)

stopCluster(cl)

summary(ridge_model)

##           Length Class     Mode   
## a0         1     -none-    numeric
## beta      35     dgCMatrix S4     
## df         1     -none-    numeric
## dim        2     -none-    numeric
## lambda     1     -none-    numeric
## dev.ratio  1     -none-    numeric
## nulldev    1     -none-    numeric
## npasses    1     -none-    numeric
## jerr       1     -none-    numeric
## offset     1     -none-    logical
## call       5     -none-    call   
## nobs       1     -none-    numeric

Applying Model 2 against our Test Data:

# Predict df_test and calculate performance
pred_ridge <- predict(ridge_model, x_test)

results <- data.frame(t(postResample(pred = pred_ridge, obs = df_test$PH))) %>% 
    mutate(Model = "Ridge Regression") %>% rbind(results)

(results)

##        RMSE  Rsquared       MAE              Model
## 1 0.1319699 0.4491237 0.1033111   Ridge Regression
## 2 0.1318883 0.4439717 0.1028018 Mutiple Regression

Model 3 - Elastic Net

Next, we build an elastic net model. Elastic net combines two types of penalties: the shrinkage of feature estimates used by ridge regression and the penalization of absolute values used by the Lasso (Least absolute shrinkage and selection operator). Here again, elastic net could help address some of the shortcomings of the dataset.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)

# training the elastic net regression model using train
elas_net_model <- train(
                        PH ~ ., data = df_train, method = "glmnet",
                        trControl = trainControl("repeatedcv", repeats = 8),
                        tuneLength = 4
)

stopCluster(cl)

summary(elas_net_model)

##             Length Class      Mode     
## a0            97   -none-     numeric  
## beta        3298   dgCMatrix  S4       
## df            97   -none-     numeric  
## dim            2   -none-     numeric  
## lambda        97   -none-     numeric  
## dev.ratio     97   -none-     numeric  
## nulldev        1   -none-     numeric  
## npasses        1   -none-     numeric  
## jerr           1   -none-     numeric  
## offset         1   -none-     logical  
## call           5   -none-     call     
## nobs           1   -none-     numeric  
## lambdaOpt      1   -none-     numeric  
## xNames        34   -none-     character
## problemType    1   -none-     character
## tuneValue      2   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list

Applying Model 3 against our Test Data:

# Make predictions
elas_net_pred <- predict(elas_net_model, df_test)
# Model performance metrics

results <- data.frame(t(postResample(pred=elas_net_pred, obs=df_test$PH))) %>% 
    mutate(Model = "ElasticNet Regression") %>% rbind(results)

(results)

##        RMSE  Rsquared       MAE                 Model
## 1 0.1313219 0.4500410 0.1027585 ElasticNet Regression
## 2 0.1319699 0.4491237 0.1033111      Ridge Regression
## 3 0.1318883 0.4439717 0.1028018    Mutiple Regression

Model 4 - Neural Network (avNNET - Modeling Averaging)

Our fourth model averages results from several neural network models. Generally, a neural network will model an outcome–e.g., PH–using a set of unobserved, or hidden variables. Specifically, we use the method avNNET from caret to aggregate a group of models by averaging, which can have a strong positive effect on model performance.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)

# using method avNNET from train, which is to aggregate several neural network models by averaging
nnetGrid <- expand.grid(.decay = c(0.1, 0.5), .size = c(1,10), .bag = FALSE)

nnet.model <- train(PH ~ ., data = df_train, method = "avNNet", preProcess = c("center", 
    "scale"), tuneGrid = nnetGrid, trControl = trainControl(method = "repeatedcv", 
    repeats = 1), trace = FALSE, linout = TRUE, maxit = 500)

stopCluster(cl)

summary(nnet.model)

##             Length Class      Mode     
## model        5     -none-     list     
## repeats      1     -none-     numeric  
## bag          1     -none-     logical  
## seeds        5     -none-     numeric  
## names       34     -none-     character
## terms        3     terms      call     
## coefnames   34     -none-     character
## xlevels      0     -none-     list     
## xNames      34     -none-     character
## problemType  1     -none-     character
## tuneValue    3     data.frame list     
## obsLevels    1     -none-     logical  
## param        3     -none-     list

Applying Model 4 against our Test Data:

# Predict df_test and calculate performance
nnet_pred <- predict(nnet.model, newdata = df_test)
results <- data.frame(t(postResample(pred = nnet_pred, obs = df_test$PH))) %>% 
    mutate(Model = "Neural Network (avNNET - Modeling Averaging)") %>% rbind(results)

(results)

##        RMSE  Rsquared        MAE                                        Model
## 1 0.1069318 0.6330649 0.08075941 Neural Network (avNNET - Modeling Averaging)
## 2 0.1313219 0.4500410 0.10275847                        ElasticNet Regression
## 3 0.1319699 0.4491237 0.10331111                             Ridge Regression
## 4 0.1318883 0.4439717 0.10280182                           Mutiple Regression

Model 5 - K-Nearest Neighbors Classification

Our fifth model uses a K-nearest neighbors (KNN) classification approach to predict PH. Each instance’s predicted value represents an average of the K nearest points in the dataset. KNN is an intuitive approach that predicts particularly well when the response shares relationships with its predictive features.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)

knnModel <- train(PH ~ ., data = df_train, method = "knn", preProc = c("center","scale"), tuneLength = 10)

stopCluster(cl)

summary(knnModel)

##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      34     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    1     -none-     logical  
## param        0     -none-     list

Applying Model 5 against our Test Data:

# Predict df_test and calculate performance
knnPred <- predict(knnModel, newdata = df_test)
results <- data.frame(t(postResample(pred = knnPred, obs = df_test$PH))) %>% 
    mutate(Model = "k-Nearest Neighbors(kNN)") %>% rbind(results)

(results)

##        RMSE  Rsquared        MAE                                        Model
## 1 0.1209728 0.5364383 0.09312178                     k-Nearest Neighbors(kNN)
## 2 0.1069318 0.6330649 0.08075941 Neural Network (avNNET - Modeling Averaging)
## 3 0.1313219 0.4500410 0.10275847                        ElasticNet Regression
## 4 0.1319699 0.4491237 0.10331111                             Ridge Regression
## 5 0.1318883 0.4439717 0.10280182                           Mutiple Regression

Model 6 - Generalized Boosted Models

A generalized boosted models, or GBM, encompasses both regression and classification. GBM uses a loss function and a weak learner to build an additive model that minimizes that loss function.

df_transformed1 <- df_transformed %>% dplyr::select (-PH)

X.train <- df_transformed1[training_set, ]
y.train <- df_transformed$PH[training_set]
X.test <- df_transformed1[-training_set, ]
y.test <- df_transformed$PH[-training_set]

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)

grid <- expand.grid(n.trees = c(50, 100, 150, 200), 
                    interaction.depth = c(1, 5, 10, 15), 
                    shrinkage = c(0.01, 0.1, 0.5), 
    n.minobsinnode = c(5, 10, 15))
gbm_Model <- train(x = X.train, 
                   y = y.train, 
                   method = "gbm", 
                   tuneGrid = grid, 
                   verbose = FALSE  # turn off the status of training process
)

# df_train_data <- as.matrix(df_train[-35])
# df_train_label <- df_train[[35]]
# dtrain <- xgb.DMatrix(data = df_train_data, label = df_train_label)
# ###
# df_test_data <- as.matrix(df_test[-35])
# df_test_label <- df_test[[35]]
# dtest <- xgb.DMatrix(data = df_test_data, label = df_test_label)
# ###
# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)
# parametersGrid <- expand.grid(eta = 0.2,
#                             colsample_bytree = 1,
#                             max_depth = c(11, 12, 13, 14),
#                             subsample = 1,
#                             gamma = 0,
#                             min_child_weight = 1,
#                             nrounds = 25
#                             )
# controls <- trainControl(method = "cv", number = 10)
# xgb1 <- train(PH ~ ., 
#                   data = df_train,
#                   method = "xgbTree",
#                   objective = "reg:squarederror",
#                   trControl = controls,
#                   tuneGrid = parametersGrid
#                   )
# xgb1
# 
stopCluster(cl)

summary(gbm_Model)

##                                     var    rel.inf
## Mnf Flow                       Mnf Flow 17.1233242
## Oxygen Filler             Oxygen Filler  5.9016356
## `Brand Code`C             `Brand Code`C  5.3820653
## Pressure Vacuum         Pressure Vacuum  5.0475294
## Usage cont                   Usage cont  4.8384888
## Filler Speed               Filler Speed  4.5838305
## Alch Rel                       Alch Rel  4.4272881
## Carb Pressure1           Carb Pressure1  4.2104804
## Temperature                 Temperature  3.7351105
## Carb Rel                       Carb Rel  3.6977023
## PC Volume                     PC Volume  3.1142509
## Air Pressurer             Air Pressurer  3.0409386
## Density                         Density  2.8197077
## Carb Flow                     Carb Flow  2.7918477
## Balling Lvl                 Balling Lvl  2.7413468
## Balling                         Balling  2.3065931
## Hyd Pressure2             Hyd Pressure2  2.2308897
## Fill Ounces                 Fill Ounces  2.1079970
## Hyd Pressure3             Hyd Pressure3  2.0923701
## Filler Level               Filler Level  2.0279712
## Bowl Setpoint             Bowl Setpoint  1.9691094
## Fill Pressure             Fill Pressure  1.9040087
## PSC                                 PSC  1.8429000
## PSC Fill                       PSC Fill  1.6105568
## Hyd Pressure4             Hyd Pressure4  1.5655389
## Carb Pressure             Carb Pressure  1.5375453
## Carb Volume                 Carb Volume  1.3666573
## Carb Temp                     Carb Temp  1.0402123
## PSC CO2                         PSC CO2  0.9311092
## `Brand Code`A             `Brand Code`A  0.9237373
## `Brand Code`B             `Brand Code`B  0.5347886
## Pressure Setpoint     Pressure Setpoint  0.2776452
## `Brand Code`Unknown `Brand Code`Unknown  0.2748229
## `Brand Code`D             `Brand Code`D  0.0000000

plot(gbm_Model)

gbm_Model$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 92     200                15       0.1             10

gbm_Model$finalModel

## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 34 predictors of which 33 had non-zero influence.

Applying Model 6 against our Test Data:

# Predict df_test and calculate performance

gbmPred <- predict(gbm_Model, newdata = df_test)
# y_pred <- predict(xgb,  data.matrix(X.test[,-1]))
results <- data.frame(t(postResample(pred = gbmPred, obs = df_test$PH))) %>% mutate(Model = "Generalized Boosted Models") %>% rbind(results)

(results)

##        RMSE  Rsquared        MAE                                        Model
## 1 0.1008262 0.6745763 0.07589511                   Generalized Boosted Models
## 2 0.1209728 0.5364383 0.09312178                     k-Nearest Neighbors(kNN)
## 3 0.1069318 0.6330649 0.08075941 Neural Network (avNNET - Modeling Averaging)
## 4 0.1313219 0.4500410 0.10275847                        ElasticNet Regression
## 5 0.1319699 0.4491237 0.10331111                             Ridge Regression
## 6 0.1318883 0.4439717 0.10280182                           Mutiple Regression

# xgb1_pred <- predict(xgb1, df_test)
# postResample(xgb1_pred, df_test$PH)

Model 7 - Multivariate Adaptive Regression Splines

The approach used for our seventh model, Multivariate Adaptive Regression Splines (MARS), creates contrasting versions of each predictor to enter the model. These versions, features known as hinge functions, each represent an exclusive portion of the data. Such features are created iteratively for all model predictors, a process that is followed by “pruning” of individual features that do not contribute to the model.

options(max.print = 1e+06)

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)

mars.grid <- expand.grid(.degree = 1:2, .nprune = 2:15)

mars.model <- train(x = X.train, y = y.train, method = "earth", tuneGrid = mars.grid, 
    preProcess = c("center", "scale"), tuneLength = 10)

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## 
## Attaching package: 'plotrix'

## The following object is masked from 'package:scales':
## 
##     rescale

## Loading required package: TeachingDemos

summary(mars.model)

## Call: earth(x=data.frame[2055,34], y=c(8.36,8.26,8.9...), keepxy=TRUE,
##             degree=2, nprune=13)
## 
##                                                     coefficients
## (Intercept)                                            8.3973280
## h(-0.202251-Mnf Flow)                                  0.1879260
## h(-0.323171-Carb Pressure1)                           -0.0421331
## h(2.01988-Temperature)                                 0.0327214
## h(Alch Rel-0.648866)                                   0.1088430
## \\Brand Code\\C * h(Hyd Pressure3-0.114264)           -0.0348800
## \\Brand Code\\C * h(0.114264-Hyd Pressure3)           -0.0538548
## h(Mnf Flow- -0.202251) * h(Usage cont- -0.255382)     -0.0409204
## h(-0.202251-Mnf Flow) * h(Pressure Vacuum-0.380367)   -0.0767888
## h(-0.202251-Mnf Flow) * h(0.380367-Pressure Vacuum)   -0.0830172
## h(-0.202251-Mnf Flow) * h(Air Pressurer-0.819609)     -0.0566259
## h(-0.571388-Balling) * h(0.648866-Alch Rel)            0.0792396
## h(Bowl Setpoint- -1.30829) * h(0.648866-Alch Rel)      0.0305173
## 
## Selected 13 of 33 terms, and 11 of 34 predictors (nprune=13)
## Termination condition: RSq changed by less than 0.001 at 33 terms
## Importance: `Mnf Flow`, `\`Brand Code\`C`, `Hyd Pressure3`, `Alch Rel`, ...
## Number of terms at each degree of interaction: 1 4 8
## GCV 0.01626822    RSS 32.43017    GRSq 0.4473468    RSq 0.4633726

stopCluster(cl)

Applying Model 7 against our Test Data:

# Predict df_test and calculate performance
mars_pred <- predict(mars.model, newdata = X.test)
results <- data.frame(t(postResample(pred = mars_pred, obs = y.test))) %>% mutate(Model = "Multivariate Adaptive Regression Splines (MARS)") %>% rbind(results)

(results)

##        RMSE  Rsquared        MAE
## 1 0.1280888 0.4765482 0.09951400
## 2 0.1008262 0.6745763 0.07589511
## 3 0.1209728 0.5364383 0.09312178
## 4 0.1069318 0.6330649 0.08075941
## 5 0.1313219 0.4500410 0.10275847
## 6 0.1319699 0.4491237 0.10331111
## 7 0.1318883 0.4439717 0.10280182
##                                             Model
## 1 Multivariate Adaptive Regression Splines (MARS)
## 2                      Generalized Boosted Models
## 3                        k-Nearest Neighbors(kNN)
## 4    Neural Network (avNNET - Modeling Averaging)
## 5                           ElasticNet Regression
## 6                                Ridge Regression
## 7                              Mutiple Regression

Model 8 - Cubist

Our eighth and final candidate is a Cubist model. Only recently made open source, Cubist is an approach that combines a variety of tree and other rule-based aspects. These aspects include smoothing, rule-making, pruning, “boosting” through committees, and prediction adjustment using distance.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)

cubist_Model <- train(x = X.train, y = y.train, method = "cubist")

stopCluster(cl)

# summary(cubist_Model)

Applying Model 8 against our Test Data:

# Predict df_test and calculate performance
Cubist.pred <- predict(cubist_Model, newdata = X.test)
results <- data.frame(t(postResample(pred = Cubist.pred, obs = y.test))) %>% mutate(Model = "Cubist Model") %>% rbind(results)

(results)

##        RMSE  Rsquared        MAE
## 1 0.0937689 0.7193900 0.06879833
## 2 0.1280888 0.4765482 0.09951400
## 3 0.1008262 0.6745763 0.07589511
## 4 0.1209728 0.5364383 0.09312178
## 5 0.1069318 0.6330649 0.08075941
## 6 0.1313219 0.4500410 0.10275847
## 7 0.1319699 0.4491237 0.10331111
## 8 0.1318883 0.4439717 0.10280182
##                                             Model
## 1                                    Cubist Model
## 2 Multivariate Adaptive Regression Splines (MARS)
## 3                      Generalized Boosted Models
## 4                        k-Nearest Neighbors(kNN)
## 5    Neural Network (avNNET - Modeling Averaging)
## 6                           ElasticNet Regression
## 7                                Ridge Regression
## 8                              Mutiple Regression

Model Summary

We evaluate our eight models using three criteria: root mean squared error (RMSE), R-squared, and mean absolute error. The table below lists these criteria for each model.

results %>% dplyr::select(Model, RMSE, Rsquared, MAE)

##                                             Model      RMSE  Rsquared
## 1                                    Cubist Model 0.0937689 0.7193900
## 2 Multivariate Adaptive Regression Splines (MARS) 0.1280888 0.4765482
## 3                      Generalized Boosted Models 0.1008262 0.6745763
## 4                        k-Nearest Neighbors(kNN) 0.1209728 0.5364383
## 5    Neural Network (avNNET - Modeling Averaging) 0.1069318 0.6330649
## 6                           ElasticNet Regression 0.1313219 0.4500410
## 7                                Ridge Regression 0.1319699 0.4491237
## 8                              Mutiple Regression 0.1318883 0.4439717
##          MAE
## 1 0.06879833
## 2 0.09951400
## 3 0.07589511
## 4 0.09312178
## 5 0.08075941
## 6 0.10275847
## 7 0.10331111
## 8 0.10280182

4. Model Selection

Based on evaluating both RMSE and \(R^2\), both Cubist and Gradient Boosted Model (GBM) outperformed the other models. This is not surprising as these models are more tolerant of multicollinearity and better account for non-linear features. While Cubist is know for running a little faster, we selected GBM for our selected model. Cubist generates a complex rules structure which while it can give higher accuracy, it might not retain that accuracy when faced with new data that doesn’t conform to the original structure. We felt GBM would generalize better and in a manufacturing setting with unknowns, it might be the more conservative choice. GBM also lends itself to clearer feature importance which is an important consideration since we want an explainable model.

varImp(gbm_Model)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                 Overall
## Mnf Flow         100.00
## Oxygen Filler     34.47
## `Brand Code`C     31.43
## Pressure Vacuum   29.48
## Usage cont        28.26
## Filler Speed      26.77
## Alch Rel          25.86
## Carb Pressure1    24.59
## Temperature       21.81
## Carb Rel          21.59
## PC Volume         18.19
## Air Pressurer     17.76
## Density           16.47
## Carb Flow         16.30
## Balling Lvl       16.01
## Balling           13.47
## Hyd Pressure2     13.03
## Fill Ounces       12.31
## Hyd Pressure3     12.22
## Filler Level      11.84

Predictions

We apply Model #6 (GMB) to the holdout evaluation set to predict the targets for these instances. We have saved these predictions as csv in the file eval_predictions.csv.

Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/eval_predictions.csv

References

A Modern Approach to Regression with R: Simon Sheather
Linear Models with R: Julian Faraway.
R package vignette, mixtools: An R Package for Analyzing Finite Mixture Models
7 Classic OLS assumptions
Detecting Multicolinearity with VIF
Applied Predictive Modeling: Kuhn & Johnson