Data624 Project 2

Group Members

Subhalaxmi Rout
Kenan Sooklall
Devin Teran
Christian Thieme
Leo Yi

/pagebreak

Introduction

We have been given a dataset from a beverage manufacturing company that consists of 2,571 rows of data and 33 columns. The dataset contains information on different beverages and their chemical composition. The goal of this analysis is to use the 32 predictive features to predict the Potential for hydrogen (pH), which is a measure of the acidity/alkalinity of the beverage. pH is the key KPI in this analysis.

We’ll begin by reading in the dataset and looking at each column’s data type:

train <- readr::read_csv('https://raw.githubusercontent.com/christianthieme/Predictive-Analytics/main/Project2/data/StudentData%20-%20TO%20MODEL.csv')

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   `Brand Code` = col_character()
## )
## i Use `spec()` for the full column specifications.

train$obs_type <- "train"

test <- readr::read_csv('https://raw.githubusercontent.com/christianthieme/Predictive-Analytics/main/Project2/data/StudentEvaluation-%20TO%20PREDICT.csv')

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   `Brand Code` = col_character(),
##   PH = col_logical()
## )
## i Use `spec()` for the full column specifications.

test$obs_type <- "test"

df <- train %>%
  rbind(test) %>%
  filter(!is.na(PH))

# convert column names to all lowercase
names(df) <- lapply(names(df), tolower)

## Warning: The `value` argument of ``names<-`()` must be a character vector as of tibble 3.0.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

glimpse(df)

## Rows: 2,567
## Columns: 34
## $ `brand code`        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B...
## $ `carb volume`       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.48666...
## $ `fill ounces`       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.3133...
## $ `pc volume`         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.1...
## $ `carb pressure`     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6,...
## $ `carb temp`         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8...
## $ psc                 <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0...
## $ `psc fill`          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34,...
## $ `psc co2`           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04,...
## $ `mnf flow`          <dbl> -100, -100, -100, -100, -100, -100, -100, -100,...
## $ `carb pressure1`    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2...
## $ `fill pressure`     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8,...
## $ `hyd pressure1`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `hyd pressure2`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `hyd pressure3`     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ `hyd pressure4`     <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 9...
## $ `filler level`      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4...
## $ `filler speed`      <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4...
## $ temperature         <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2,...
## $ `usage cont`        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74...
## $ `carb flow`         <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 29...
## $ density             <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84,...
## $ mfr                 <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, N...
## $ balling             <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298...
## $ `pressure vacuum`   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4,...
## $ ph                  <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38,...
## $ `oxygen filler`     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066...
## $ `bowl setpoint`     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 12...
## $ `pressure setpoint` <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0,...
## $ `air pressurer`     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2...
## $ `alch rel`          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52,...
## $ `carb rel`          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34,...
## $ `balling lvl`       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44,...
## $ obs_type            <chr> "train", "train", "train", "train", "train", "t...

We see that all columns, with the exception of brand, are doubles and continuous. Excluding the response variable, this means that we have 1 categorical variable and 31 continuous variables to work with.

Exploratory Data Analysis

In the output above, we can see that there are missing values (NAs). Let’s see how pervasive this issue is within our dataset:

visdat::vis_miss(df %>% filter(obs_type == 'train'), sort_miss = TRUE)

In total, only about 1% of our data is missing. We can see that most of the columns are only missing a negligible amount of data. mfr and brand code have the largest amount of missing values and are missing 8.25% and 4.67% of their data, respectively. Additionally, there does not appear to be a pattern in which values are missing. Now that we understand that our missing values are not a pervasive issue, we’ll continue with our analysis.

Distribution of Response Variable: pH

Let’s get an understanding of the distribution of our response variable:

df %>% 
  filter(obs_type == 'train') %>%
  select(ph) %>%
  ggplot() + 
  aes(x = ph) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of pH is left-skewed and multi-modal. Generally speaking, when we see a multi-modal distribution, often times that is an indication that there are sub-populations within the distribution. We know from looking at our dataset that there is a brand code with values A, B, C, and D. Let’s break up the above distribution into 4 distributions based on these values:

df %>% 
  filter(obs_type == 'train') %>%
  #filter(`brand code` == 'A') %>% 
  select(`brand code`, ph) %>%
  ggplot() + 
  aes(x = ph) + 
  geom_histogram() +
  labs(title = "Brand A") + 
  facet_wrap(~`brand code`)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Breaking down to this further grain does not seem to be much more helpful. There may be even more granular sub-populations within this data that we are not seeing.

Now we’ll turn our attention to the numeric features within our dataset:

inspectdf::inspect_num(df %>%  filter(obs_type == 'train') %>% select(-ph)) %>% 
  show_plot()

We note the following about these distributions:

air pressurer - there appears to be either two distributions here, or a single distribution with a pocket of outliers
balling, balling lvl, density,fill pressure, hyd pressure1, hyd pressure2, hyd pressure3, hyd pressure4, mnf flow, pressure setpoint- there appears to be two distributions here. This could potentially be connected to the type of brand code or something else not as easily distinguishable.
bowl setpoint - half of all the values are around 120
carb flow - most values fall between 3,000 and 4,000 with a large pocket of values at 1,000 as well
filler speed, mfr, oxygen filler - either appears to have two distributions or a few significant outliers
general note: it appears that many of these distributions are skewed in one way or another. We note that a transformation may be helpful when generating predictions.

Explanatory Variable Relationships with the Response Variable

Now that we’ve looked at our response variable, let’s look at our explanatory variables. We’ll begin first by looking at brand code, which is our only categorical variable:

df %>% 
  filter(obs_type == 'train') %>%
  select(`brand code`,ph) %>%
  ggplot() + 
  aes(y = ph) + 
  geom_boxplot()+
  labs(title = 'Brand Code Box Plots') + 
  facet_grid(~`brand code`, scales = 'free_x')

We can see from the above boxplots that brand code does have a meaningful relationship with pH. We can also see some significant outliers in C and possibly D that that will need to be evaluated further. We’ll now turn our attention to the numeric features in our dataset.

Numeric Features

df %>% 
  filter(obs_type == 'train') %>%
 # data.frame() %>%
  select(-`brand code`, -obs_type) %>% 
  #filter(`brand code` == 'C') %>%
  gather(variable, value, -ph) %>%
  ggplot(aes(x = value, y = ph)) +
  geom_point() +
  facet_wrap(~variable, scales = "free_x") +
  labs(x = element_blank())

We note the following about the relationship between pH and these variables:

air pressurer - it appears that there are two sub-groups here. In looking to see if this was due to brand code we found that these sub-groups exist even at individual group levels
alch rel - most points appear to be gathered in 3 distinct areas, however there do appear to be 7 outliers
balling, balling lvl - it appears that there are two sub-groups here. These sub-groups do potentially look to be associated with brand code
bowl setpoint - appears to potentially be a categorical variable as the values don’t appear to be continuous. Also appear to potentially be some outliers
carb flow - appear to be two or three groups of data points. Also note the presence of outliers
density - appear to be two to three groups of points. We note the presence of an outlier
filler speed - appear to mostly fall within the low or high range. Values in the middle are less frequent.
pressure setpoint - appear to be discrete values with the exception of 4 outliers
pressure vacuum - appear to be discrete values with a potentially positive linear relationship. We note the presence of an outlier
psc co2 - appear to be discrete values. We note the presence of an outlier
psc fill - there appear to be 5 bands that values can fall into with certain areas that do not have values. We may consider adding a categorical variable to capture this
carb pressure, carb pressure1, carb rel, carb volume, fill ounces, fill pressure, filler level, hyd pressure1-4, mfr, oxygen filler, pc volume, psc, temperature, usage cont- no visible relationship. We do note the presence of outliers
General note: It appears that many of these variables are on different scales. We’ll take care of this during our data prep phase.

Correlated Features

For many models, correlation between features can be an issue. Let’s see what the correlation between our variables looks like:

numeric_values <- df %>% filter(obs_type == 'train') %>% dplyr::select_if(is.numeric)
numeric_values <- numeric_values[complete.cases(numeric_values),] %>% data.frame()
train_cor <- cor(numeric_values)
corrplot::corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt')

We see many of our features are highly correlated. There are several methods we could use to solve this, however, because we have many features, it may make sense to use principal component analysis, which will allow us to reduce the number of columns in our model and hopefully produce a simpler model.

Summary EDA Notes * Feature distributions are skewed and may benefit from a transformation * Missing data will most likely not be a significant issue * There are several outliers in our features - we should think about using a modeling technique that is robust against outliers * Many of our features are significantly correlated with each other. PCA or another method may be helpful in reducing collinearity * There appear to be sub-populations even within brand codes. It may be helpful to do some feature engineering to tease this information from the data

Data Processing

Dummy Variables

We’ll need to transform our categorical feature brand codes into a dummy variable:

dmy <- dummyVars(~`brand code`, data = df)
dmy_cols <- predict(dmy, df)

df <- df %>% 
  cbind(dmy_cols) %>%
  select(-`brand code`)

Data Imputation

Earlier in our analysis, we saw that there were quite a few missing values. Before imputing data, we’ll first perform a Yeo-Johnson transformation (similar to BoxCox, but accepts negative values as well), then we’ll center and scale the data. Following our transformations, we’ll apply principal component analysis in an effort to remove collinearity. Finally, we’ll impute our missing values using KNN with 5 nearest neighbors:

df_features <- df %>% #filter(obs_type == 'train') %>% 
  select(-ph) %>% 
  data.frame()


trans <- caret::preProcess(
  df_features, 
  method = c("YeoJohnson", "center", "scale", "pca", "knnImpute"), 
  k = 5
  )

df_transformed <- predict(trans, df_features)

#codes <- df %>% filter(obs_type == 'train') %>% select(`brand code`) 

#train_transformed <- train_transformed %>% cbind(codes)

Let’s see if we have any missing values in our data now:

colSums(is.na(df_transformed))

## obs_type      PC1      PC2      PC3      PC4      PC5      PC6      PC7 
##        0        0        0        0        0        0        0        0 
##      PC8      PC9     PC10     PC11     PC12     PC13     PC14     PC15 
##        0        0        0        0        0        0        0        0 
##     PC16     PC17     PC18     PC19 
##        0        0        0        0

Baseline Model

We’ll now run a baseline model that will be a starting point to measure all other models against. We’ve chosen a random forest model because as we noted above, there are outliers within our dataset, and tree methods are often robust against outliers. We are using the default parameters for this initial run:

feature_df <- df_transformed %>% 
  filter(obs_type == 'train') %>% 
  select(-obs_type)

response_df <- df %>% 
  filter(obs_type == 'train') %>% 
  select(-obs_type)


rfModel <- randomForest(feature_df, response_df$ph)
rfModel

## 
## Call:
##  randomForest(x = feature_df, y = response_df$ph) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.01512613
##                     % Var explained: 49.16