Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both RPubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
Our team’s analysis seeks to build understanding of the ABC Beverage manufacturing process and the related factors that affect the pH of the company’s beverage products. Our goal is to build a model that both predicts product hP, given manufacturing steps and identify which steps appear to have the most impact on pH.
We have been provided with historic data for product batches including data on each manufacturing step along with the final measured pH. We will start by understanding the dataset. Specifically are the any missing data, outliers or odd feature distributions that might complicate modeling. We will then do any necessary data cleaning, split our data into training and testing set so we can more accurately determine model performance on out-of-set data samples. We will preform a number of different machine learning approaches, touching on different broad prediction approaches including: linear regression, multiple regression, penalized regression, non-linear regression, tree-based, and neural network. Different methodologies can perform better depending on the nature of the data, so it makes sense to try a number of approaches and choose the one that best handles our specific dataset. We will then choose the model that performs best and use that to predict final pH on a holdout evaluation dataset.
This model could help the company adapt its processes in a changing regulatory environment.
Note - we are doing an observational study so any correlations we identify would need to be followed up with testing to identify causal relationships.
The training data set contains 32 categorical, continuous, or discrete features and 2571 rows, with 267 rows reserved for an evaluation set that lacks the target. That target is PH, which should be a continuous variable but has 52 distinct values in the training set. As a result, possible predictive models could include regression, classification, or an ensemble of both.
There are two files provided:
PH, the feature we seek to predict.PH. Our model will have to be scored by an outside group with knowledge of the actual pH values.Note: Both Excel files are in simple CSV format.
# Load crime dataset
df <- read_excel('datasets/StudentData.xlsx')
df_eval <- read_excel('datasets/StudentEvaluation.xlsx')
# remove the empty PH column from the evaluation data
df_eval <- df_eval %>%
dplyr::select(-PH)Below is a list of the variables of interest in the data set:
Brand Code: categorical, values: A, B, C, DCarb Volume:Fill Ounces:PC Volume:Carb Pressure:Carb Temp:PSC:PSC Fill:PSC CO2:Mnf Flow:Carb Pressure1:Fill Pressure:Hyd Pressure1:Hyd Pressure2:Hyd Pressure3:Hyd Pressure4:Filler Level:Filler Speed:Temperature:Usage cont:Carb Flow:Density:MFR:Balling:Pressure Vacuum:PH: the TARGET we will try to predict.Bowl Setpoint:Pressure Setpoint:Air Pressurer:Alch Rel:Carb Rel:Balling Lvl:We compiled summary statistics on our dataset to better understand the data before modeling.
| Name | df |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
First, across features, there are numerous missing data–coded as NA–that will need to be imputed. Especially note that 4 rows are missing a PH value. We will need to drop these rows as they cannot be used for training. Second, the basic histograms suggest that skewness is prevalent across features. Examples include PSC CO2 and MFR. And third, some of the skewed features appear to show near-zero variance, with a large number of 0 or even negative values, e.g. Hyd Pressure1 and Hyd Pressure2. In general, the skewness and imbalance may require imputation.
If our target, PH is particularly skewed, it could lead to biased predictions.
PH is normally distributed with possible outliers on the low and high ends. This distribution suggests a pure classification approach could be problematic as the predictions may favor pH values in the mid-range (where there are more data points). Natural boundaries may exist such that classification adds predictive information. However, given the normal shape, a regression or possible ensemble with regression and classification seems more appropriate. Also, we note that models may have more problems predicting pH values in the extremes. There are fewer observations at the low an high pH which means less information to help a model tune for these regions.
Before continuing, let us better understand any patterns of missingness across predictor features.
| variable | n_miss | pct_miss |
|---|---|---|
| MFR | 212 | 8.2458187 |
| Brand Code | 120 | 4.6674446 |
| Filler Speed | 57 | 2.2170362 |
| PC Volume | 39 | 1.5169195 |
| PSC CO2 | 39 | 1.5169195 |
| Fill Ounces | 38 | 1.4780241 |
| PSC | 33 | 1.2835473 |
| Carb Pressure1 | 32 | 1.2446519 |
| Hyd Pressure4 | 30 | 1.1668611 |
| Carb Pressure | 27 | 1.0501750 |
| Carb Temp | 26 | 1.0112797 |
| PSC Fill | 23 | 0.8945935 |
| Fill Pressure | 22 | 0.8556982 |
| Filler Level | 20 | 0.7779074 |
| Hyd Pressure2 | 15 | 0.5834306 |
| Hyd Pressure3 | 15 | 0.5834306 |
| Temperature | 14 | 0.5445352 |
| Oxygen Filler | 12 | 0.4667445 |
| Pressure Setpoint | 12 | 0.4667445 |
| Hyd Pressure1 | 11 | 0.4278491 |
| Carb Volume | 10 | 0.3889537 |
| Carb Rel | 10 | 0.3889537 |
| Alch Rel | 9 | 0.3500583 |
| Usage cont | 5 | 0.1944769 |
| PH | 4 | 0.1555815 |
| Mnf Flow | 2 | 0.0777907 |
| Carb Flow | 2 | 0.0777907 |
| Bowl Setpoint | 2 | 0.0777907 |
| Density | 1 | 0.0388954 |
| Balling | 1 | 0.0388954 |
| Balling Lvl | 1 | 0.0388954 |
| Pressure Vacuum | 0 | 0.0000000 |
| Air Pressurer | 0 | 0.0000000 |
Notice that approximately 8.25 percent of the rows are missing a value for MFR. We may need to drop this feature considering that, as missingness increases, so do the potential negative consequences of imputation. Additionally, the categorical feature Brand Code is missing approximately 4.67 percent of its values. Since we do not know whether these values represent another brand or are actually missing, we will create a new feature category ‘Unknown’ consisting of missing values. The rest of the features are only missing small percentages of values, suggesting that KNN imputation should be safe.
Next, we visualize the distributions of each of the predictor features. The visuals will help us select features for modeling, assess relationships between features and with PH, and identify outliers as well as transformations that might improve model resolution.
The distribution profiles show the prevalence of kurtosis, specifically right skew in variables Oxygen Filler, PSC, and Temperature and left skew in Filler Speed and MFR. These deviations from a traditional normal distribution can be problematic for linear regression assumptions, and thus we might need to transform the data. Several features are discrete with limited possible values, e.g. Pressure Setpoint. Furthermore, we have a number of bimodel features–see Air Pressurer, Balling, and Balling Level.
Bimodal features in a dataset are problematic but interesting, representing areas of potential opportunity and exploration. They suggest the existence of two different groups, or classes, within a given feature. These groups may have separate but overlapping distributions that could provide powerful predictive power in a model.
Were we tackling in-depth feature engineering in this analysis, we could leverage the package, mixtools (see R Vignette). This package helps regress mixed models where data can be subdivided into subgroups. We could then add new binary features to indicate for each instance, the distribution to which it belongs.
Here is a quick example showing a possible mix within Air Pressurer:
# Select `Air Pressurer` column and remove any missing data
df_mix <- df %>%
dplyr::select(`Air Pressurer`) %>%
tidyr::drop_na()
# Calculate mixed distributions for indus
air_pressure_mix <- normalmixEM(df_mix$`Air Pressurer`,
lambda = .5,
mu = c(140, 148),
sigma = 1,
maxit=60)## number of iterations= 7
# Simple plot to illustrate possible bimodal mix of groups
plot(air_pressure_mix,
whichplots = 2,
density = TRUE,
main2 = "`Air Pressurer` Possible Distributions",
xlab2 = "Air Pressurer")Lastly, several features have relatively normal distributions along with high numbers of values at an extreme. We have no information on whether these extreme values are mistakes, data errors, or otherwise inexplicable. As such, we will need to review each associated feature to determine whether to impute the values, leave them as is, or apply feature engineering.
We also elected to use boxplots to understand the spread of each feature.
The boxplots reveal outliers, though none of them seem egregious enough to warrant imputing or removal. Outliers should only be imputed or dropped if we have reason to believe they are errant or contain no critical information.
Next, we generate scatter plots of each predictor versus the target to get an idea of the relationship between them.
The scatter plots indicate some clear relationships between our target and predictor features, such as PH and Oxygen Filter or PH and Alch Rel. However, we also see clear correlations between some of the predictors, like Carb Temp and Carb Pressure. Overall, although our plots indicate some interesting relationships, they also underline the aforementioned possible issues with the data. For instance, many predictors have skewed distributions, and in some cases, missing data may be recorded as ‘0’.
We next quantify the relationships visualized above. In general, our model should focus on features showing stronger positive or negative correlations with PH. Features with correlations closer to zero will probably not provide any meaningful information on pH levels.
## values ind
## 1 0.361587534 Bowl Setpoint
## 2 0.352043962 Filler Level
## 3 0.233593699 Carb Flow
## 4 0.219735497 Pressure Vacuum
## 5 0.196051481 Carb Rel
## 6 0.166682228 Alch Rel
## 7 0.164485364 Oxygen Filler
## 8 0.109371168 Balling Lvl
## 9 0.098866734 PC Volume
## 10 0.095546936 Density
## 11 0.076700227 Balling
## 12 0.076213407 Carb Pressure
## 13 0.072132509 Carb Volume
## 14 0.032279368 Carb Temp
## 15 -0.007997231 Air Pressurer
## 16 -0.023809796 PSC Fill
## 17 -0.040882953 Filler Speed
## 18 -0.045196477 MFR
## 19 -0.047066423 Hyd Pressure1
## 20 -0.069873041 PSC
## 21 -0.085259857 PSC CO2
## 22 -0.118335902 Fill Ounces
## 23 -0.118764185 Carb Pressure1
## 24 -0.171434026 Hyd Pressure4
## 25 -0.182659650 Temperature
## 26 -0.222660048 Hyd Pressure2
## 27 -0.268101792 Hyd Pressure3
## 28 -0.311663908 Pressure Setpoint
## 29 -0.316514463 Fill Pressure
## 30 -0.357611993 Usage cont
## 31 -0.459231253 Mnf Flow
It appears that Bowl Setpoint, Filler Level, Carb Flow, Pressure Vacuum, and Carb Rel have the highest correlations (positive) with PH, while Mnf Flow, Usage cont, Fill Pressure, Pressure Setpoint, and Hyd Pressure3 have the strongest negative correlations with PH. The other features have a weak or slightly negative correlation, which implies they have less predictive power.
One problem that can occur with multiple regression is a correlation between predictive features, or multicollinearity. A quick check is to run correlations between all predictors.
We can see that some variables are highly correlated with one another, such as Balling Level and Carb Volume, Carb Rel, Alch Rel, Density, and Balling, with a correlation between 0.75 and 1. When we start considering features for our models, we’ll need to account for the correlations between features and avoid including pairs with strong correlations.
As a note, this dataset is challenging as many of the predictive features go hand-in-hand with other features and multicollinearity will be a problem.
Lastly, we want to check for any features that show near zero-variance. Features that are the same across most of the instances will add little predictive information.
## freqRatio percentUnique zeroVar nzv
## Hyd Pressure1 31.11111 9.529366 FALSE TRUE
Hyd Pressure1 displays near-zero variance. We will drop this feature prior to modeling.
To summarize our data preparation and exploration, we distinguish our findings into a few categories below.
MFR has more than 8% missing values - remove this feature.Hyd Pressure1 shows little variance - remove this feature.PH that need to be removed.Brand Code with “Unknown”.kNN() from the VIM packagekNN() from the VIM package.set.seed(181)
# drop rows with missing PH
df_clean <- df_clean %>%
filter(!is.na(PH))
# Change Brand Code missing to 'Unknown' in our training data
brand_code <- df_clean %>%
dplyr::select(`Brand Code`) %>%
replace_na(list(`Brand Code` = 'Unknown'))
df_clean$`Brand Code` <- brand_code$`Brand Code`
# Change Brand Code missing to 'Unknown' in our evaluation data
brand_code <- df_eval_clean %>%
dplyr::select(`Brand Code`) %>%
replace_na(list(`Brand Code` = 'Unknown'))
df_eval_clean$`Brand Code` <- df_eval_clean$`Brand Code`
# There is an edge case where our Eval data might have a `Brand Code` not seen in our training set.
# If so, let's convert them to 'Unknown'. This is appropriate since any model trained without the
# new value wouldn't be able to glean any info from it.
codes <- unique(df_clean$`Brand Code`)
df_eval_clean <- df_eval_clean %>%
mutate(`Brand Code` = if_else(`Brand Code` %in% codes, `Brand Code`, 'Unknown'))
# Use the kNN imputing method from VIM package to impute missing values in our training data
df_clean <- df_clean %>%
kNN(k=10) %>%
dplyr::select(colnames(df_clean))
# Use the kNN imputing method from VIM package to impute missing values in our training data
df_eval_clean <- df_eval_clean %>%
kNN(k=10) %>%
dplyr::select(colnames(df_eval_clean))We do not drop any outliers given all values seem reasonable.
Brand Code is a categorical variable with values A, B, C, D and Unknown. We convert it to a set of dummy columns for modeling.
# -----
# Training data - Convert our `Brand Code` column into a set of dummy variables
df_clean_dummy <- dummyVars(PH ~ `Brand Code`, data = df_clean)
dummies <- predict(df_clean_dummy, df_clean)
# Get the dummy column names
dummy_cols <- sort(colnames(dummies))
# Make sure the new dummy columns are sorted in alpha order (to make sure our columns will match the eval dataset)
dummies <- as.tibble(dummies) %>%
dplyr::select(dummy_cols)
# remove the original categorical feature
df_clean <- df_clean %>%
dplyr::select(-`Brand Code`)
# add the new dummy columns to our main training dataframe
df_clean <- cbind(dummies, df_clean)
# -----
# Evaluation data - Convert our `Brand Code` column into a set of dummy variables
#df_eval_clean <- dummyVars(PH ~ `Brand Code`, data = df_eval_clean)
df_eval_clean$PH <- 1
eval_dummies <- predict(df_clean_dummy, df_eval_clean)
# Edge Case - if the eval dataset is doesn't have a specific `Brand Code`
# we will be missing the necessary dummy column. Let's check and if necessary add
# appropriate dummy columns with all 0's.
for (c in dummy_cols) {
if (!(c %in% colnames(eval_dummies))) {
eval_dummies[c] <- 0
}
}
# Now sort the eval_dummy columns so they match the training set dummies
eval_dummy_cols <- sort(colnames(eval_dummies))
eval_dummies <- as.tibble(eval_dummies) %>%
dplyr::select(eval_dummy_cols)
# remove the original categorical feature
df_eval_clean <- df_eval_clean %>%
dplyr::select(-c(`Brand Code`, PH))
# add the new dummy columns to our main eval dataframe
df_eval_clean <- cbind(eval_dummies, df_eval_clean)Finally, as mentioned earlier in our data exploration, and our findings from our histogram plots, we can see that some of our features are highly skewed. To address this skewness, we scale, center, and apply the Box-Cox transformation to the skewed features using preProcess from caret. These transformations should result in distributions that better approximate normal and thus facilitate modeling.
## Created from 2567 samples and 34 variables
##
## Pre-processing:
## - Box-Cox transformation (22)
## - centered (34)
## - ignored (0)
## - scaled (34)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00000 -1.90000 -0.15000 -0.08182 1.15000 2.00000
Here are some plots to demonstrate the changes in distributions after the transformations:
# Prepare data for ggplot
gather_df <- df_transformed %>%
dplyr::select(-c(PH)) %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(gather_df) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=4)As expected, the dummy variables, e.g. ``Brand Code``A, appear binary. We still have bimodal features since we did not apply any feature engineering to address them. A few features, including PSC Fill and Temperature, still show skew, but they seem closer to normal. Our transformations are complete, and we can continue on to building our models.
With a now solid understanding of our dataset, and with our data cleaned, we can now start to build candidate models. First, we split our cleaned dataset into training and testing sets (80% training, 20% testing). This split is necessary as the provided evaluation data set does not provide PH values, meaning we cannot measure our model performance against that dataset.
set.seed(181)
# utilizing one dataset for all four models
training_set <- createDataPartition(df_transformed$PH, p=0.8, list=FALSE)
df_train <- df_transformed[training_set,]
df_test <- df_transformed[-training_set,]Using our training dataset, we build a multiple linear regression model that regresses PH on, initially, all of the features not removed during data preparation. We then use a stepwise process to home in on solely the most significant features.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)
# Model 1 - Build Multi-Linear Regression
model1_raw <- lm(PH ~ ., df_train)
# Build model 1 - only significant features (using stepAIC)
model1 <- stepAIC(model1_raw, direction = "both",
scope = list(upper = model1_raw, lower = ~ 1),
scale = 0, trace = FALSE)
stopCluster(cl)
# Display Model 1 Summary
(lmf_s <- summary(model1))##
## Call:
## lm(formula = PH ~ `\`Brand Code\`A` + `\`Brand Code\`B` + `\`Brand Code\`C` +
## `Fill Ounces` + `PC Volume` + PSC + `PSC CO2` + `Mnf Flow` +
## `Carb Pressure1` + `Fill Pressure` + `Hyd Pressure2` + `Hyd Pressure3` +
## `Filler Level` + Temperature + `Usage cont` + `Carb Flow` +
## Density + Balling + `Pressure Vacuum` + `Oxygen Filler` +
## `Bowl Setpoint` + `Pressure Setpoint` + `Alch Rel` + `Balling Lvl`,
## data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50945 -0.07982 0.00982 0.08504 0.43065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.547150 0.002915 2932.148 < 2e-16 ***
## `\\`Brand Code\\`A` -0.016743 0.004227 -3.961 7.73e-05 ***
## `\\`Brand Code\\`B` 0.029783 0.006849 4.348 1.44e-05 ***
## `\\`Brand Code\\`C` -0.024726 0.004844 -5.104 3.63e-07 ***
## `Fill Ounces` -0.005894 0.003064 -1.924 0.054533 .
## `PC Volume` -0.008119 0.003577 -2.270 0.023313 *
## PSC -0.006766 0.003069 -2.205 0.027584 *
## `PSC CO2` -0.006481 0.002972 -2.180 0.029353 *
## `Mnf Flow` -0.086780 0.006344 -13.679 < 2e-16 ***
## `Carb Pressure1` 0.030781 0.003577 8.605 < 2e-16 ***
## `Fill Pressure` 0.011077 0.004300 2.576 0.010066 *
## `Hyd Pressure2` -0.023931 0.008542 -2.802 0.005131 **
## `Hyd Pressure3` 0.062535 0.010064 6.214 6.25e-10 ***
## `Filler Level` -0.016405 0.009345 -1.755 0.079332 .
## Temperature -0.017323 0.003320 -5.217 2.00e-07 ***
## `Usage cont` -0.019289 0.003768 -5.120 3.35e-07 ***
## `Carb Flow` 0.009855 0.003864 2.551 0.010824 *
## Density -0.020765 0.009086 -2.285 0.022401 *
## Balling -0.088497 0.014300 -6.189 7.32e-10 ***
## `Pressure Vacuum` -0.014525 0.004279 -3.394 0.000702 ***
## `Oxygen Filler` -0.011546 0.004079 -2.830 0.004694 **
## `Bowl Setpoint` 0.053075 0.009502 5.586 2.64e-08 ***
## `Pressure Setpoint` -0.017965 0.004387 -4.095 4.38e-05 ***
## `Alch Rel` 0.027232 0.011035 2.468 0.013678 *
## `Balling Lvl` 0.107707 0.016128 6.678 3.11e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1319 on 2030 degrees of freedom
## Multiple R-squared: 0.4153, Adjusted R-squared: 0.4084
## F-statistic: 60.08 on 24 and 2030 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 8.541432972 8.5528662976
## `\\`Brand Code\\`A` -0.025032457 -0.0084526361
## `\\`Brand Code\\`B` 0.016350785 0.0432148570
## `\\`Brand Code\\`C` -0.034225499 -0.0152255359
## `Fill Ounces` -0.011902406 0.0001147270
## `PC Volume` -0.015132935 -0.0011046087
## PSC -0.012783804 -0.0007475681
## `PSC CO2` -0.012309755 -0.0006513135
## `Mnf Flow` -0.099222383 -0.0743385016
## `Carb Pressure1` 0.023765704 0.0377968490
## `Fill Pressure` 0.002643829 0.0195097510
## `Hyd Pressure2` -0.040682619 -0.0071802953
## `Hyd Pressure3` 0.042798604 0.0822704157
## `Filler Level` -0.034731970 0.0019220412
## Temperature -0.023834539 -0.0108108559
## `Usage cont` -0.026677337 -0.0119001011
## `Carb Flow` 0.002277800 0.0174320499
## Density -0.038584542 -0.0029450834
## Balling -0.116541117 -0.0604526406
## `Pressure Vacuum` -0.022916789 -0.0061323473
## `Oxygen Filler` -0.019546262 -0.0035462859
## `Bowl Setpoint` 0.034440071 0.0717092181
## `Pressure Setpoint` -0.026567991 -0.0093622248
## `Alch Rel` 0.005590770 0.0488734326
## `Balling Lvl` 0.076077342 0.1393359230
# Display Variable feature importance plot
variableImportancePlot(model1, "Model 1 LM Variable Importance")## [1] "VIF scores of predictors"
## `\\`Brand Code\\`A` `\\`Brand Code\\`B` `\\`Brand Code\\`C` `Fill Ounces`
## 2.112206 5.538897 2.716281 1.101626
## `PC Volume` PSC `PSC CO2` `Mnf Flow`
## 1.479643 1.103823 1.030926 4.760599
## `Carb Pressure1` `Fill Pressure` `Hyd Pressure2` `Hyd Pressure3`
## 1.525962 2.187856 8.628236 12.049196
## `Filler Level` Temperature `Usage cont` `Carb Flow`
## 10.325127 1.340368 1.690024 1.803029
## Density Balling `Pressure Vacuum` `Oxygen Filler`
## 9.766020 24.149939 2.122035 1.981706
## `Bowl Setpoint` `Pressure Setpoint` `Alch Rel` `Balling Lvl`
## 10.724204 2.281632 14.349801 30.617008
Applying Model 1 against our Test Data:
We employ ridge regression for our second model. Ridge regression uses shrinkage to penalizes feature estimates in efforts to control possible inflation due to multicollinearity. Given the noted presence of high correlations across features in this dataset, ridge seems like an appropriate modeling approach.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)
# to find the right lambda using cv.glmnet
x_train <- model.matrix(PH ~ ., data = df_train)
x_test <- model.matrix(PH ~ ., data = df_test)
cv.glmnet <- cv.glmnet(x_train, df_train$PH, alpha = 0)
ridge_model <- glmnet(x_train, df_train$PH, alpha = 0, lambda = cv.glmnet$lambda.min)
stopCluster(cl)
summary(ridge_model)## Length Class Mode
## a0 1 -none- numeric
## beta 35 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
Applying Model 2 against our Test Data:
# Predict df_test and calculate performance
pred_ridge <- predict(ridge_model, x_test)
results <- data.frame(t(postResample(pred = pred_ridge, obs = df_test$PH))) %>%
mutate(Model = "Ridge Regression") %>% rbind(results)
(results)## RMSE Rsquared MAE Model
## 1 0.1319699 0.4491237 0.1033111 Ridge Regression
## 2 0.1318883 0.4439717 0.1028018 Mutiple Regression
Next, we build an elastic net model. Elastic net combines two types of penalties: the shrinkage of feature estimates used by ridge regression and the penalization of absolute values used by the Lasso (Least absolute shrinkage and selection operator). Here again, elastic net could help address some of the shortcomings of the dataset.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)
# training the elastic net regression model using train
elas_net_model <- train(
PH ~ ., data = df_train, method = "glmnet",
trControl = trainControl("repeatedcv", repeats = 8),
tuneLength = 4
)
stopCluster(cl)
summary(elas_net_model)## Length Class Mode
## a0 97 -none- numeric
## beta 3298 dgCMatrix S4
## df 97 -none- numeric
## dim 2 -none- numeric
## lambda 97 -none- numeric
## dev.ratio 97 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## lambdaOpt 1 -none- numeric
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
Applying Model 3 against our Test Data:
# Make predictions
elas_net_pred <- predict(elas_net_model, df_test)
# Model performance metrics
results <- data.frame(t(postResample(pred=elas_net_pred, obs=df_test$PH))) %>%
mutate(Model = "ElasticNet Regression") %>% rbind(results)
(results)## RMSE Rsquared MAE Model
## 1 0.1313219 0.4500410 0.1027585 ElasticNet Regression
## 2 0.1319699 0.4491237 0.1033111 Ridge Regression
## 3 0.1318883 0.4439717 0.1028018 Mutiple Regression
Our fourth model averages results from several neural network models. Generally, a neural network will model an outcome–e.g., PH–using a set of unobserved, or hidden variables. Specifically, we use the method avNNET from caret to aggregate a group of models by averaging, which can have a strong positive effect on model performance.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)
# using method avNNET from train, which is to aggregate several neural network models by averaging
nnetGrid <- expand.grid(.decay = c(0.1, 0.5), .size = c(1,10), .bag = FALSE)
nnet.model <- train(PH ~ ., data = df_train, method = "avNNet", preProcess = c("center",
"scale"), tuneGrid = nnetGrid, trControl = trainControl(method = "repeatedcv",
repeats = 1), trace = FALSE, linout = TRUE, maxit = 500)
stopCluster(cl)
summary(nnet.model)## Length Class Mode
## model 5 -none- list
## repeats 1 -none- numeric
## bag 1 -none- logical
## seeds 5 -none- numeric
## names 34 -none- character
## terms 3 terms call
## coefnames 34 -none- character
## xlevels 0 -none- list
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 3 data.frame list
## obsLevels 1 -none- logical
## param 3 -none- list
Applying Model 4 against our Test Data:
# Predict df_test and calculate performance
nnet_pred <- predict(nnet.model, newdata = df_test)
results <- data.frame(t(postResample(pred = nnet_pred, obs = df_test$PH))) %>%
mutate(Model = "Neural Network (avNNET - Modeling Averaging)") %>% rbind(results)
(results)## RMSE Rsquared MAE Model
## 1 0.1069318 0.6330649 0.08075941 Neural Network (avNNET - Modeling Averaging)
## 2 0.1313219 0.4500410 0.10275847 ElasticNet Regression
## 3 0.1319699 0.4491237 0.10331111 Ridge Regression
## 4 0.1318883 0.4439717 0.10280182 Mutiple Regression
Our fifth model uses a K-nearest neighbors (KNN) classification approach to predict PH. Each instance’s predicted value represents an average of the K nearest points in the dataset. KNN is an intuitive approach that predicts particularly well when the response shares relationships with its predictive features.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)
knnModel <- train(PH ~ ., data = df_train, method = "knn", preProc = c("center","scale"), tuneLength = 10)
stopCluster(cl)
summary(knnModel)## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 34 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
Applying Model 5 against our Test Data:
# Predict df_test and calculate performance
knnPred <- predict(knnModel, newdata = df_test)
results <- data.frame(t(postResample(pred = knnPred, obs = df_test$PH))) %>%
mutate(Model = "k-Nearest Neighbors(kNN)") %>% rbind(results)
(results)## RMSE Rsquared MAE Model
## 1 0.1209728 0.5364383 0.09312178 k-Nearest Neighbors(kNN)
## 2 0.1069318 0.6330649 0.08075941 Neural Network (avNNET - Modeling Averaging)
## 3 0.1313219 0.4500410 0.10275847 ElasticNet Regression
## 4 0.1319699 0.4491237 0.10331111 Ridge Regression
## 5 0.1318883 0.4439717 0.10280182 Mutiple Regression
A generalized boosted models, or GBM, encompasses both regression and classification. GBM uses a loss function and a weak learner to build an additive model that minimizes that loss function.
df_transformed1 <- df_transformed %>% dplyr::select (-PH)
X.train <- df_transformed1[training_set, ]
y.train <- df_transformed$PH[training_set]
X.test <- df_transformed1[-training_set, ]
y.test <- df_transformed$PH[-training_set]
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)
grid <- expand.grid(n.trees = c(50, 100, 150, 200),
interaction.depth = c(1, 5, 10, 15),
shrinkage = c(0.01, 0.1, 0.5),
n.minobsinnode = c(5, 10, 15))
gbm_Model <- train(x = X.train,
y = y.train,
method = "gbm",
tuneGrid = grid,
verbose = FALSE # turn off the status of training process
)
# df_train_data <- as.matrix(df_train[-35])
# df_train_label <- df_train[[35]]
# dtrain <- xgb.DMatrix(data = df_train_data, label = df_train_label)
# ###
# df_test_data <- as.matrix(df_test[-35])
# df_test_label <- df_test[[35]]
# dtest <- xgb.DMatrix(data = df_test_data, label = df_test_label)
# ###
# cl <- makePSOCKcluster(5)
# registerDoParallel(cl)
# parametersGrid <- expand.grid(eta = 0.2,
# colsample_bytree = 1,
# max_depth = c(11, 12, 13, 14),
# subsample = 1,
# gamma = 0,
# min_child_weight = 1,
# nrounds = 25
# )
# controls <- trainControl(method = "cv", number = 10)
# xgb1 <- train(PH ~ .,
# data = df_train,
# method = "xgbTree",
# objective = "reg:squarederror",
# trControl = controls,
# tuneGrid = parametersGrid
# )
# xgb1
#
stopCluster(cl)
summary(gbm_Model)## var rel.inf
## Mnf Flow Mnf Flow 17.1233242
## Oxygen Filler Oxygen Filler 5.9016356
## `Brand Code`C `Brand Code`C 5.3820653
## Pressure Vacuum Pressure Vacuum 5.0475294
## Usage cont Usage cont 4.8384888
## Filler Speed Filler Speed 4.5838305
## Alch Rel Alch Rel 4.4272881
## Carb Pressure1 Carb Pressure1 4.2104804
## Temperature Temperature 3.7351105
## Carb Rel Carb Rel 3.6977023
## PC Volume PC Volume 3.1142509
## Air Pressurer Air Pressurer 3.0409386
## Density Density 2.8197077
## Carb Flow Carb Flow 2.7918477
## Balling Lvl Balling Lvl 2.7413468
## Balling Balling 2.3065931
## Hyd Pressure2 Hyd Pressure2 2.2308897
## Fill Ounces Fill Ounces 2.1079970
## Hyd Pressure3 Hyd Pressure3 2.0923701
## Filler Level Filler Level 2.0279712
## Bowl Setpoint Bowl Setpoint 1.9691094
## Fill Pressure Fill Pressure 1.9040087
## PSC PSC 1.8429000
## PSC Fill PSC Fill 1.6105568
## Hyd Pressure4 Hyd Pressure4 1.5655389
## Carb Pressure Carb Pressure 1.5375453
## Carb Volume Carb Volume 1.3666573
## Carb Temp Carb Temp 1.0402123
## PSC CO2 PSC CO2 0.9311092
## `Brand Code`A `Brand Code`A 0.9237373
## `Brand Code`B `Brand Code`B 0.5347886
## Pressure Setpoint Pressure Setpoint 0.2776452
## `Brand Code`Unknown `Brand Code`Unknown 0.2748229
## `Brand Code`D `Brand Code`D 0.0000000
## n.trees interaction.depth shrinkage n.minobsinnode
## 92 200 15 0.1 10
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 34 predictors of which 33 had non-zero influence.
Applying Model 6 against our Test Data:
# Predict df_test and calculate performance
gbmPred <- predict(gbm_Model, newdata = df_test)
# y_pred <- predict(xgb, data.matrix(X.test[,-1]))
results <- data.frame(t(postResample(pred = gbmPred, obs = df_test$PH))) %>% mutate(Model = "Generalized Boosted Models") %>% rbind(results)
(results)## RMSE Rsquared MAE Model
## 1 0.1008262 0.6745763 0.07589511 Generalized Boosted Models
## 2 0.1209728 0.5364383 0.09312178 k-Nearest Neighbors(kNN)
## 3 0.1069318 0.6330649 0.08075941 Neural Network (avNNET - Modeling Averaging)
## 4 0.1313219 0.4500410 0.10275847 ElasticNet Regression
## 5 0.1319699 0.4491237 0.10331111 Ridge Regression
## 6 0.1318883 0.4439717 0.10280182 Mutiple Regression
The approach used for our seventh model, Multivariate Adaptive Regression Splines (MARS), creates contrasting versions of each predictor to enter the model. These versions, features known as hinge functions, each represent an exclusive portion of the data. Such features are created iteratively for all model predictors, a process that is followed by “pruning” of individual features that do not contribute to the model.
options(max.print = 1e+06)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)
mars.grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
mars.model <- train(x = X.train, y = y.train, method = "earth", tuneGrid = mars.grid,
preProcess = c("center", "scale"), tuneLength = 10)## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
## Loading required package: TeachingDemos
## Call: earth(x=data.frame[2055,34], y=c(8.36,8.26,8.9...), keepxy=TRUE,
## degree=2, nprune=13)
##
## coefficients
## (Intercept) 8.3973280
## h(-0.202251-Mnf Flow) 0.1879260
## h(-0.323171-Carb Pressure1) -0.0421331
## h(2.01988-Temperature) 0.0327214
## h(Alch Rel-0.648866) 0.1088430
## \\Brand Code\\C * h(Hyd Pressure3-0.114264) -0.0348800
## \\Brand Code\\C * h(0.114264-Hyd Pressure3) -0.0538548
## h(Mnf Flow- -0.202251) * h(Usage cont- -0.255382) -0.0409204
## h(-0.202251-Mnf Flow) * h(Pressure Vacuum-0.380367) -0.0767888
## h(-0.202251-Mnf Flow) * h(0.380367-Pressure Vacuum) -0.0830172
## h(-0.202251-Mnf Flow) * h(Air Pressurer-0.819609) -0.0566259
## h(-0.571388-Balling) * h(0.648866-Alch Rel) 0.0792396
## h(Bowl Setpoint- -1.30829) * h(0.648866-Alch Rel) 0.0305173
##
## Selected 13 of 33 terms, and 11 of 34 predictors (nprune=13)
## Termination condition: RSq changed by less than 0.001 at 33 terms
## Importance: `Mnf Flow`, `\`Brand Code\`C`, `Hyd Pressure3`, `Alch Rel`, ...
## Number of terms at each degree of interaction: 1 4 8
## GCV 0.01626822 RSS 32.43017 GRSq 0.4473468 RSq 0.4633726
Applying Model 7 against our Test Data:
# Predict df_test and calculate performance
mars_pred <- predict(mars.model, newdata = X.test)
results <- data.frame(t(postResample(pred = mars_pred, obs = y.test))) %>% mutate(Model = "Multivariate Adaptive Regression Splines (MARS)") %>% rbind(results)
(results)## RMSE Rsquared MAE
## 1 0.1280888 0.4765482 0.09951400
## 2 0.1008262 0.6745763 0.07589511
## 3 0.1209728 0.5364383 0.09312178
## 4 0.1069318 0.6330649 0.08075941
## 5 0.1313219 0.4500410 0.10275847
## 6 0.1319699 0.4491237 0.10331111
## 7 0.1318883 0.4439717 0.10280182
## Model
## 1 Multivariate Adaptive Regression Splines (MARS)
## 2 Generalized Boosted Models
## 3 k-Nearest Neighbors(kNN)
## 4 Neural Network (avNNET - Modeling Averaging)
## 5 ElasticNet Regression
## 6 Ridge Regression
## 7 Mutiple Regression
Our eighth and final candidate is a Cubist model. Only recently made open source, Cubist is an approach that combines a variety of tree and other rule-based aspects. These aspects include smoothing, rule-making, pruning, “boosting” through committees, and prediction adjustment using distance.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(181)
cubist_Model <- train(x = X.train, y = y.train, method = "cubist")
stopCluster(cl)
# summary(cubist_Model)Applying Model 8 against our Test Data:
# Predict df_test and calculate performance
Cubist.pred <- predict(cubist_Model, newdata = X.test)
results <- data.frame(t(postResample(pred = Cubist.pred, obs = y.test))) %>% mutate(Model = "Cubist Model") %>% rbind(results)
(results)## RMSE Rsquared MAE
## 1 0.0937689 0.7193900 0.06879833
## 2 0.1280888 0.4765482 0.09951400
## 3 0.1008262 0.6745763 0.07589511
## 4 0.1209728 0.5364383 0.09312178
## 5 0.1069318 0.6330649 0.08075941
## 6 0.1313219 0.4500410 0.10275847
## 7 0.1319699 0.4491237 0.10331111
## 8 0.1318883 0.4439717 0.10280182
## Model
## 1 Cubist Model
## 2 Multivariate Adaptive Regression Splines (MARS)
## 3 Generalized Boosted Models
## 4 k-Nearest Neighbors(kNN)
## 5 Neural Network (avNNET - Modeling Averaging)
## 6 ElasticNet Regression
## 7 Ridge Regression
## 8 Mutiple Regression
We evaluate our eight models using three criteria: root mean squared error (RMSE), R-squared, and mean absolute error. The table below lists these criteria for each model.
## Model RMSE Rsquared
## 1 Cubist Model 0.0937689 0.7193900
## 2 Multivariate Adaptive Regression Splines (MARS) 0.1280888 0.4765482
## 3 Generalized Boosted Models 0.1008262 0.6745763
## 4 k-Nearest Neighbors(kNN) 0.1209728 0.5364383
## 5 Neural Network (avNNET - Modeling Averaging) 0.1069318 0.6330649
## 6 ElasticNet Regression 0.1313219 0.4500410
## 7 Ridge Regression 0.1319699 0.4491237
## 8 Mutiple Regression 0.1318883 0.4439717
## MAE
## 1 0.06879833
## 2 0.09951400
## 3 0.07589511
## 4 0.09312178
## 5 0.08075941
## 6 0.10275847
## 7 0.10331111
## 8 0.10280182
Based on evaluating both RMSE and \(R^2\), both Cubist and Gradient Boosted Model (GBM) outperformed the other models. This is not surprising as these models are more tolerant of multicollinearity and better account for non-linear features. While Cubist is know for running a little faster, we selected GBM for our selected model. Cubist generates a complex rules structure which while it can give higher accuracy, it might not retain that accuracy when faced with new data that doesn’t conform to the original structure. We felt GBM would generalize better and in a manufacturing setting with unknowns, it might be the more conservative choice. GBM also lends itself to clearer feature importance which is an important consideration since we want an explainable model.
## gbm variable importance
##
## only 20 most important variables shown (out of 34)
##
## Overall
## Mnf Flow 100.00
## Oxygen Filler 34.47
## `Brand Code`C 31.43
## Pressure Vacuum 29.48
## Usage cont 28.26
## Filler Speed 26.77
## Alch Rel 25.86
## Carb Pressure1 24.59
## Temperature 21.81
## Carb Rel 21.59
## PC Volume 18.19
## Air Pressurer 17.76
## Density 16.47
## Carb Flow 16.30
## Balling Lvl 16.01
## Balling 13.47
## Hyd Pressure2 13.03
## Fill Ounces 12.31
## Hyd Pressure3 12.22
## Filler Level 11.84
We apply Model #6 (GMB) to the holdout evaluation set to predict the targets for these instances. We have saved these predictions as csv in the file eval_predictions.csv.
Source code: https://github.com/djlofland/DATA624_F2020_Group/tree/master/eval_predictions.csv