Data624 Project 2

Group Members

Subhalaxmi Rout
Kenan Sooklall
Devin Teran
Christian Thieme
Leo Yi

Introduction

We have been given a dataset from a beverage manufacturing company that consists of 2,571 rows of data and 33 columns. The dataset contains information on different beverages and their chemical composition. The goal of this analysis is to use the 32 predictive features to predict the Potential for hydrogen (pH), which is a measure of the acidity/alkalinity of the beverage. pH is the key KPI in this analysis.

We’ll begin by reading in the dataset and looking at each column’s data type:

data_raw <- readr::read_csv('https://raw.githubusercontent.com/christianthieme/Predictive-Analytics/main/Project2/data/StudentData%20-%20TO%20MODEL.csv')
data_raw$obs_type <- "train"
eval_raw <- readr::read_csv('https://raw.githubusercontent.com/christianthieme/Predictive-Analytics/main/Project2/data/StudentEvaluation-%20TO%20PREDICT.csv')
eval_raw$obs_type <- "test"
combined <- data_raw %>%
  rbind(eval_raw)
# convert column names to all lowercase
names(combined) <- lapply(names(combined), tolower)
# convert column name spaces to underscore
names(combined) <- str_replace_all(names(combined), ' ', '_')
df <- combined %>%
  filter(obs_type == 'train') %>%
  select(-obs_type)
eval <- combined %>%
  filter(obs_type == 'test') %>%
  select(-obs_type)
glimpse(combined)

## Rows: 2,838
## Columns: 34
## $ brand_code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ carb_volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ fill_ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ pc_volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ carb_pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ carb_temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ psc               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ psc_fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ psc_co2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ mnf_flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ carb_pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ fill_pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ hyd_pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure4     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ filler_level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ filler_speed      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ usage_cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ carb_flow         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ mfr               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ pressure_vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ ph                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ oxygen_filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ bowl_setpoint     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ pressure_setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ air_pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ alch_rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ carb_rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ balling_lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
## $ obs_type          <chr> "train", "train", "train", "train", "train", "train"…

We see that all columns, with the exception of brand, are doubles and continuous. Excluding the response variable, this means that we have 1 categorical variable and 31 continuous variables to work with.

Exploratory Data Analysis

In the output above, we can see that there are missing values (NAs). Let’s see how pervasive this issue is within our dataset:

df %>%
  visdat::vis_miss(sort_miss = TRUE)

In total, only about 1% of our data is missing. We can see that most of the columns are only missing a negligible amount of data. mfr and brand code have the largest amount of missing values and are missing 8.25% and 4.67% of their data, respectively. Additionally, there does not appear to be a pattern in which values are missing. Now that we understand that our missing values are not a pervasive issue, we’ll continue with our analysis.

Distribution of Response Variable: pH

Let’s get an understanding of the distribution of our response variable:

df %>%
  select(ph) %>%
  ggplot() + 
  aes(x = ph) + 
  geom_histogram(fill = "lightsalmon2", color = "black") + 
  labs(title = "Histogram of pH", y = "Count") +  
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
 #   plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

The distribution of pH is left-skewed and multi-modal. Generally speaking, when we see a multi-modal distribution, often times that is an indication that there are sub-populations within the distribution. We know from looking at our dataset that there is a brand code with values A, B, C, and D. Let’s break up the above distribution into 4 distributions based on these values:

df %>%
  ggplot() + 
  aes(x = ph) + 
  geom_histogram(fill = "lightsalmon2", color = "black") +
  labs(title = "Distribution of pH by Brand") + 
  facet_wrap(~brand_code, scales = 'free_y') +  
  theme(
    plot.title = element_text(hjust = 0.45),
 #   plot.margin = ggplot2::margin(10, 20, 10, 10),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

Breaking down to this further grain does not seem to be much more helpful in eliminating some of the bi/multi-modal nature of these distributions. There may be even more granular sub-populations within this data that we are not seeing.

Now we’ll turn our attention to the numeric features within our dataset:

df %>%
  select(-ph) %>%
  inspectdf::inspect_num() %>%
  show_plot() +
  labs(title = 'Distribution of Numeric Columns in Training Data')

We note the following about these distributions:

air_pressurer - there appears to be either two distributions here, or a single distribution and a pocket of outliers
balling, balling_lvl, density,fill_ _pressure, hyd_pressure1, hyd_pressure2, hyd_pressure3, hyd_pressure4, mnf_flow, pressure_setpoint- there appears to be two distributions here. This could potentially be connected to the type of brand_code or something else not as easily distinguishable.
bowl_setpoint - half of all the values are around 120
carb_flow - most values fall between 3,000 and 4,000 with a large pocket of values at 1,000 as well
filler_speed, mfr, oxygen_filler - either appears to have two distributions or a few significant outliers
general note: it appears that many of these distributions are skewed in one way or another. We note that a transformation may be helpful when generating predictions.

Explanatory Variable Relationships with the Response Variable

Now that we’ve looked at our response variable, let’s look at our explanatory variables. We’ll begin first by looking at brand code, which is our only categorical variable:

df %>%
  dplyr::select(brand_code, ph) %>%
  ggplot() + 
  aes(x = brand_code, y = ph) + 
  geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35)+
  labs(title = 'Brand Code Box Plots',
       x = element_blank(),
       y = 'pH') + 
  # facet_grid(~brand_code) +
#  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
   # panel.background = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

We can see from the above boxplots that brand_code does have a meaningful relationship with pH. We can also see some significant outliers in C and possibly D that will need to be evaluated further. We’ll now turn our attention to the numeric features in our dataset.

Numeric Features

df %>%
  select(-brand_code) %>% 
  gather(variable, value, -ph) %>%
  ggplot(aes(x = value, y = ph)) +
  geom_point(color = 'steelblue', alpha = 0.35) +
  facet_wrap(~variable, scales = "free_x") +
  labs(title = "pH Relationship with Independent Variables", x = element_blank())+ 
  theme(
    plot.title = element_text(hjust = 0.45),
    panel.grid.major.y =  element_line(color = "grey", linetype = "dashed"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks.x = element_line(color = "grey")
  )

We note the following about the relationship between pH and these variables:

air_pressurer - it appears that there are two sub-groups here. In looking to see if this was due to brand_code we found that these sub-groups exist even at individual group levels
alch_rel - most points appear to be gathered in 3 distinct areas, however there do appear to be 7 outliers
balling, balling_lvl - it appears that there are two sub-groups here. These sub-groups do potentially look to be associated with brand_code
bowl_setpoint - appears to potentially be a categorical variable as the values don’t appear to be continuous. Also appear to potentially be some outliers
carb_flow - appear to be two or three groups of data points. Also note the presence of outliers
density - appear to be two to three groups of points. We note the presence of an outlier
filler_speed - appear to mostly fall within the low or high range. Values in the middle are less frequent.
pressure_setpoint - appear to be discrete values with the exception of 4 outliers
pressure_vacuum - appear to be discrete values with a potentially positive linear relationship. We note the presence of an outlier
psc_co2 - appear to be discrete values. We note the presence of an outlier
psc_fill - there appear to be 5 bands that values can fall into with certain areas that do not have values. We may consider adding a categorical variable to capture this
carb_pressure, carb_pressure1, carb_rel, carb_volume, fill_ounces, fill_pressure, filler_level, hyd_pressure1-4, mfr, oxygen_filler, pc_volume, psc, temperature, usage_cont- no visible relationship. We do note the presence of outliers
General note: It appears that many of these variables are on different scales. We’ll take care of this during our data prep phase. Additionally, there looks to be one outlier in most of these visuals. This outlier may need to be removed.

Correlated Features

For many models, correlation between features can be an issue. Let’s see what the correlation between our variables looks like:

numeric_values <- df %>% 
  dplyr::select_if(is.numeric)
numeric_values <- numeric_values[complete.cases(numeric_values),] %>% 
  data.frame()
train_cor <- cor(numeric_values)
corrplot::corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt')

We see many of our features are highly correlated. There are several methods we could use to solve this, however, because we have many features, it may make sense to use principal component analysis, which will allow us to reduce the number of columns in our model and hopefully produce a simpler model. Additionally, many tree-based models are not as sensitive to collinearity like linear models, therefore, we’ll focus our efforts on non-linear models in the modeling phase.

Summary EDA Notes

Feature distributions are skewed and may benefit from a transformation
Missing data will most likely not be a significant issue
There are several outliers in our features - we should think about using a modeling technique that is robust against outliers
Many of our features are significantly correlated with each other. PCA or another method may be helpful in reducing collinearity
There appear to be sub-populations even within brand_codes. It may be helpful to do some feature engineering to tease this information from the data

Data Processing

Outliers

In our analysis of the data, we noted several outliers. Most of those outliers come from one value with a pH of 9.36. We removed this outlier, but saw a marked reduction in accuracy and so made the decision to keep the value in the training set.

Feature Engineering

Many attempts were made at feature engineering, including combining features, rounding features, using other features as a percentage of other features, etc., however, any additional features seemed to decrease the performance of our model. It may be that each feature has fairly strong signal on its own, and our engineering may have been confounding that signal. As such, we have not included any additional features in our dataset.

Data Imputation

We’ll need to impute missing data for both numeric and categorical variables. We’ll use the knnImpute method for the numeric variables and a multinomial logistic regression model to impute the brand_code variable.

We do not use the combined dataset to train our model or impute values under the assumption that this evaluation data wouldn’t be available to us when training our models. In practice, when we’re faced with a new dataset, it would be impossible to use that data in the process of training our models.

# remove rows without a response variable
df3 <- df %>%
  filter(!is.na(ph)) %>%
  mutate(brand_code = factor(brand_code))

df3$brand_code <- case_when(
  df3$brand_code == "A" ~ 1,
  df3$brand_code == "B" ~ 2,
  df3$brand_code == "C" ~ 3,
  df3$brand_code == "D" ~ 4,
)

# remove near zero variance variables
isNZV <- nearZeroVar(df3)
df3 <- df3[, -isNZV] # hyd_pressure1

# imputing numeric variables
mode <- function(df, ...) {
  tb <- table(df)
  which(tb == max(tb)) %>% sort %>% `[`(., 1) %>% names
}

impute <- function(df, fun) {
  fval <- fun(df, na.rm = TRUE)
  y <- ifelse(is.na(df), fval, df)
  return(y)
}

df3 <- df3 %>% mutate(
    brand_code = impute(brand_code, mode),
    carb_volume     = impute(carb_volume, mean),
    fill_ounces = impute(fill_ounces, mean),
    pc_volume = impute(pc_volume, mean),
    carb_pressure = impute(carb_pressure, mean),
    carb_temp  = impute(carb_temp, mean),
    psc = impute(psc, mean),
    psc_fill = impute(psc_fill, mean),
    psc_co2 = impute(psc_co2, mean),
    carb_pressure1 = impute(carb_pressure1, mean),
    fill_pressure = impute(fill_pressure, mean),
    #hyd_pressure1 = impute(hyd_pressure1, mean),
    hyd_pressure2 = impute(hyd_pressure2, mean),
    hyd_pressure3 = impute(hyd_pressure3, mean),
    hyd_pressure4 = impute(hyd_pressure4, mean),
    filler_level = impute(filler_level, mean),
    filler_speed = impute(filler_speed, mean),
    temperature = impute(temperature, mean),
    usage_cont = impute(usage_cont, mean),
    carb_flow = impute(carb_flow, mean),
    mfr = impute(mfr, mean),
    oxygen_filler = impute(oxygen_filler, mean),
    bowl_setpoint = impute(bowl_setpoint, mean),
    pressure_setpoint = impute(pressure_setpoint, mean),
    alch_rel = impute(alch_rel, mean),
    carb_rel = impute(carb_rel, mean),
    balling_lvl = impute(balling_lvl, mean)
    ) %>% 
  mutate_if(is.character, as.numeric)

visdat::vis_miss(df3)

Processing Note

As part of our processing, we attempted many different transformations and adjustments, but saw no improvement in our RMSE or MAE, in fact, many of the changes negatively affected our accuracy, as such, we chose a simpler processing approach. We experimented with YeoJohnson transformations, centering, scaling, excluding highly correlated features, and principal component analysis. Ultimately, we determined that only using KNN to impute missing values in our numerical columns and a multinomial model to impute missing values in our categorical column produced the most accurate results.

Train Test Split

After imputing, we’ll use caret’s createDataPartition to determine a stratified 80/20 split. We will use this test dataset to evaluate all our models, however, if two or more models have close error and variance measures, we can continue with cross validation on additional splits. During our EDA we noted that brand_code was a significant feature within our dataset, as such we chose to use a stratified sampling approach in an effort to make our training and testing sets more uniform.

set.seed(101)
trainIndex <-  createDataPartition(df3$ph,
                                   p = 0.8,
                                   list = F)
train <- df3[trainIndex, ]
test <- df3[-trainIndex, ]

Modeling

Having split our data into a training and testing set, we’re now ready to fit and train our models. Due to the non-linear relationships of our features to the response variable as well the outliers we observed and other issues noted during our exploratory data analysis, we have determined that non-linear models would most likely perform best on this dataset when creating predictions. With this in mind, we’ll fit and train several different non-linear models in an effort to to see which one predicts with the lowest error and least variance on the test dataset. Once we determine which model is the most accurate, we’ll move forward with creating predictions on the provided scoring set. We have chosen the following models to train:

Multivariate adaptive regression splines (MARS)
Support-vector machines (SVM)
K-Nearest neighbors (KNN)
Random Forest regression (RF)
Cubist regression
Neural Network (NN)

As part of our training, we are using cross fold validation in an effort to select the optimal parameters for each model.

# MARS
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(101)
mars <- train(ph ~ .,
              data = train,
              method = 'earth',
              tuneGrid = marsGrid,
              trControl = trainControl(method = 'cv')
              )
# SVM
svm <- train(ph ~ .,
             data = train,
             method = 'svmRadial',
             preProc = c('center', 'scale'),
             tuneLength = 14,
             trControl = trainControl(method = 'cv')
             )
# KNN
set.seed(101)
knn <- train(ph ~ .,
             data = train,
             method = 'knn',
             preProc = c('center', 'scale'),
             tuneGrid = data.frame(.k = 1:20),
             trControl = trainControl(method = 'cv')
             )
# random forest
rf <- randomForest(ph ~ .,
                   data = train,
                   ntrees = 1000)
# cubist
cubist <- train(ph ~ .,
                data = train,
                method = 'cubist')

Note: Due to the long length of time it takes to train our Neural Nets models, we have configured the below code chunk not to run, but have saved the results of the model run in the table below in the Compare Models section.

# Avg Neural Nets
set.seed(100)
trainingData  <- select(train, -ph)
trainingData <- data.frame(trainingData)
testData  <- select(test, -ph)
testData <- data.frame(testData)
neural_model <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),
                        .bag = FALSE)
# get the maximum number of hidden units
maxSize <- max(neural_model$.size)
# there are M(p+1)+M+1 parameters in total
numWts <- 1*(maxSize * (length(trainingData) + 1) + maxSize + 1)
ctrl <- trainControl(method = "cv")
avNNet_model <- train(trainingData, train$ph,
                 method = "avNNet",
                 tuneGrid = neural_model,
                 trControl = ctrl,
                 preProcess = c("center", "scale"),
                 linout = TRUE,
                 trace = FALSE,
                 maxit = 500,
                 MaxNWts = numWts
                 )

avNNet_predict <- predict(avNNet_model, newdata = testData)
avNNet_results <- postResample(pred = avNNet_predict, obs = test$ph)
avNNet_results$MAPE <- MAPE(test$ph, avNNet_predict)

Compare Models

Having trained each of our models, we’ll now turn our attention to evaluating the predictive power of each of the models. We’ll use postResample to measure the error between each of the model’s predictions and the holdout test data and display the accuracy metrics in the table below:

# predictions
test$mars <- predict(mars, test)
test$svm <- predict(svm, test)
test$knn <- predict(knn, test)
test$rf <- predict(rf, test)
test$cubist <- predict(cubist, test)
# measure of error and variance
metrics <- data.frame(rbind(
  postResample(pred = test$mars, obs = test$ph),
  postResample(pred = test$svm, obs = test$ph),
  postResample(pred = test$knn, obs = test$ph),
  postResample(pred = test$rf, obs = test$ph),
  postResample(pred = test$cubist, obs = test$ph)
),
  row.names = c('MARS', 'SVM', 'KNN', 'Random Forest', 'Cubist')
)

metrics$MAPE <- c(MAPE(test$ph, test$mars), MAPE(test$ph, test$svm), 
                  MAPE(test$ph, test$knn), MAPE(test$ph, test$rf), MAPE(test$ph, test$cubist))

Model Type	RMSE	\(R^2\)	MAE	MAPE
MARS	0.1176799	0.4952452	0.0887886	99.9757625
SVM	0.1119914	0.5462167	0.0810064	99.8763735
KNN	0.1162803	0.5153739	0.0882178	99.8192806
RF	0.0946641	0.6905932	0.0694998	99.9888221
Cubist	0.0917986	0.692324	0.0670975	99.9619656
Neural Nets	0.1134418	0.5324197	0.08416577	99.98299

Cross Validation

The measure of error for the random forest and the cubist models were close, so we’ll do one pass of cross validation with a new train/test split in order to have additional confidence in the chosen model.

# create new train/test split with different seed and 75/25 split
set.seed(2)
trainIndex <-  createDataPartition(df3$ph,
                                   p = 0.75,
                                   list = F)
train2 <- df3[trainIndex, ]
test2 <- df3[-trainIndex, ]
# random forest
rf2 <- randomForest(ph ~ .,
                    data = train2,
                    ntrees = 1000)
# cubist
cubist2 <- train(ph ~ .,
                 data = train2,
                 method = 'cubist')
# predictions
test2$rf <- predict(rf2, test2)
test2$cubist <- predict(cubist2, test2)
# measure of error and variance
metrics2 <- data.frame(rbind(
  postResample(pred = test$cubist, obs = test$ph),
  postResample(pred = test2$rf, obs = test2$ph),
  postResample(pred = test2$cubist, obs = test2$ph)
),
  row.names = c('Cubist v1', 'Random Forest v2',  'Cubist v2')
)
metrics2$MAPE <- c(MAPE(test$ph, test$cubist), MAPE(test2$ph, test2$rf), 
                   MAPE(test2$ph, test2$cubist))

Model Type	RMSE	\(R^2\)	MAE	MAPE
Cubist v1	0.0917986	0.692324	0.0670975	99.9619656
RF v2	0.1021197	0.6520538	0.0733234	99.9617631
Cubist v2	0.0981522	0.662651	0.0699042	99.9608776

Comparing the 2nd fit of random forest vs cubist, the 2nd random forest fit shows a relatively greater error when predicting on the new test set, so the cubist is the better model again. When comparing the first and second cubist models, the first cubist outperforms the second.

Based on the model results above, we have chosen the Cubist regression model as our final model. The cubist model has the lowest RMSE and MAE.The \(R^2\) values should only be used to compare models of the same type.

Variable Importance

Having selected our best preforming model, we can now use caret’s varImp method to look at what features within our dataset are most important when fitting the model.

plot(caret::varImp(cubist), main = "Cubist Regression Model Variable Importance")

It appears that mnf_flow, brand code, balling, balling_lvl and density are key to our model’s performance. We saw in our correlation plot that mnf_flow had the strongest relationship with pH. Additionally, we saw in our box plots that brand code had a significant relationship with pH as well.

balling, balling_lvl, and density are all within the top 7 most important variables in this model, with significant correlations. We attempted to model after excluding the collinear variables, but that resulted in a less accurate model. It’s possible that the variable importance of these three variables might be misrepresented due to correlation.

Predictions on the Test Set

We’ll use the Cubist regression model to predict on our test set. Before making predictions, we’ll need to perform the same data processing steps as we have done with our training dataset.

Imputing Evaluation Data

We’ll need to impute missing data on the final evaluation data. To do this, we’ll train another multinomial logistic regression model for brand code, excluding pH. Finally, we’ll use knnImpute again for the numeric variables. We’ll be using the entire combined dataset to impute these values under the assumption that we have trained our models on existing data and can leverage that to help us get a more accurate insight into the true population distributions.

# get all data excluding ph
eval2 <- combined %>%
  select(-ph)
# convert brand code to numeric
eval2$brand_code <- case_when(
  eval2$brand_code == "A" ~ 1,
  eval2$brand_code == "B" ~ 2,
  eval2$brand_code == "C" ~ 3,
  eval2$brand_code == "D" ~ 4,
)


# isolate evaluation data only
eval2 <- eval2 %>%
  filter(obs_type == 'test') %>%
  select(-obs_type)

# # remove near zero variance variables
# isNZV_2 <- nearZeroVar(eval2)
# eval2 <- eval2[, -isNZV_2]

eval2 <- eval2 %>% mutate(
    brand_code = impute(brand_code, mode),
    carb_volume     = impute(carb_volume, mean),
    fill_ounces = impute(fill_ounces, mean),
    pc_volume = impute(pc_volume, mean),
    #carb_pressure = impute(carb_pressure, mean),
    carb_temp  = impute(carb_temp, mean),
    psc = impute(psc, mean),
    psc_fill = impute(psc_fill, mean),
    psc_co2 = impute(psc_co2, mean),
    carb_pressure1 = impute(carb_pressure1, mean),
    fill_pressure = impute(fill_pressure, mean),
    #hyd_pressure1 = impute(hyd_pressure1, mean),
    hyd_pressure2 = impute(hyd_pressure2, mean),
    hyd_pressure3 = impute(hyd_pressure3, mean),
    hyd_pressure4 = impute(hyd_pressure4, mean),
    filler_level = impute(filler_level, mean),
    filler_speed = impute(filler_speed, mean),
    temperature = impute(temperature, mean),
    usage_cont = impute(usage_cont, mean),
    density = impute(density, mean),
    #carb_flow = impute(carb_flow, mean),
    mfr = impute(mfr, mean),
    balling = impute(balling, mean),
    pressure_vacuum = impute(pressure_vacuum, mean),
    oxygen_filler = impute(oxygen_filler, mean),
    bowl_setpoint = impute(bowl_setpoint, mean),
    pressure_setpoint = impute(pressure_setpoint, mean),
    air_pressurer = impute(air_pressurer, mean),
    alch_rel = impute(alch_rel, mean),
    carb_rel = impute(carb_rel, mean)
    
    ) %>% 
  mutate_if(is.character, as.numeric)

#sum(is.na(eval2))

Having imputed any missing values in our test set, we are now ready to make our final predictions.

Final Predictions

We make our final predictions by calling the predict method and passing in the eval2 dataframe which is our test set with all the necessary imputations for missing values. Once the predictions have been made, we’ll push them to a CSV file where they can be evaluated.

final_predictions <- predict(cubist, eval2)
summary(final_predictions)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.148   8.465   8.529   8.553   8.667   8.841

head(final_predictions, n = 10)

##  [1] 8.489836 8.407301 8.495484 8.667449 8.500791 8.581749 8.457839 8.552064
##  [9] 8.524187 8.697361

#write.csv(final_predictions, "final_predictions.csv")

Conclusion and Next Steps

Throughout this project, we imported the data, performed extensive exploratory data analysis, processed our data, and finally trained a model and made predictions. Throughout this process, we determined that the Cubist Regression model was the most accurate model, and used it to generate our final predictions. We did an extensive amount of experimenting with tranformations, imputation methods, and feature engineering, but at the end of the day, the simplist dataset seemed to perform the best.

With the unique nature of this dataset, we believe strong domain knowledge would be necessary to take this model to the next level. Often with these detailed datasets with many columns, it is hard to create new meaningful features without strong domain knowledge in the area. We believe feature engineering would be the place to focus and that additional modeling gains could be made by creating additional features from this rich dataset.