Background

The purpose of our second project is to work as a team to apply concepts from the 2nd half of our Predictive Analytics course to a beverage data set. More specifically, to explore the data, determine whether we might use a linear regression, non-linear regression or tree-based model, to then build, compare, select our optimal model, and support why we made the selection that we did.

ABC Beverage Spec

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

# read in data (as CSVs)
train <- read.csv("https://raw.githubusercontent.com/Magnus-PS/DATA-624/main/624P2TrainData.csv")
test <- read.csv("https://raw.githubusercontent.com/Magnus-PS/DATA-624/main/624P2EvalData.csv")

# rename i..Brand.Code
names(train)[names(train) == "ï..Brand.Code"] <- "Brand.Code"
names(test)[names(test) == "ï..Brand.Code"] <- "Brand.Code"

# label datasets
train$Dataset <- 'train'
test$Dataset <- 'test'

# merge datasets
final_df <- rbind(train, test)

Approach

We will explore, analyze and model a data set containing approximately 2,600 training and 300 test records representing various brands of a beverage. The variables appear to be related to beverage chemical and manufacturing properties as well as the brand of beverage created.

The response variable is PH which represents the pH level of each beverage. We’ll generate multiple models and then select the best performing model for our training data prior to casting predictions for our test data’s PH variable (which is currently all NAs).

We’ll follow the general, high-level approach that is common in Data Science:

Exploratory Data Analysis (EDA),
Data Preparation,
Model Building, and
Model Selection

Prior to casting our predictions.

Exploratory Data Analysis (EDA)

The purpose of exploring our data is to enhance the precision of the questions we’re asking while building a firm understanding of the data. The aim is to familiarize ourselves with the data’s structure, the status of missing values, outliers, predictive strength, and correlation to then take the actions necessary to prepare our data for model building.

We get to know the structure, value ranges, proportion of missing values, distribution of values and correlation with the target variable. After this point, we should have enough insight to prepare our data and build our model(s).

Data Structure

We utilize the built-in glimpse method to gain insight into dimensions, variable characteristics, and value ranges for our training data:

glimpse(train)

## Rows: 2,571
## Columns: 34
## $ Brand.Code        <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B~
## $ Carb.Volume       <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.~
## $ Fill.Ounces       <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23~
## $ PC.Volume         <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333~
## $ Carb.Pressure     <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2~
## $ Carb.Temp         <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141~
## $ PSC               <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,~
## $ PSC.Fill          <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12~
## $ PSC.CO2           <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14~
## $ Mnf.Flow          <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100~
## $ Carb.Pressure1    <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124~
## $ Fill.Pressure     <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0~
## $ Hyd.Pressure1     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure2     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure3     <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Hyd.Pressure4     <int> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86~
## $ Filler.Level      <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118~
## $ Filler.Speed      <int> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, ~
## $ Temperature       <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4~
## $ Usage.cont        <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.~
## $ Carb.Flow         <int> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3~
## $ Density           <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90~
## $ MFR               <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74~
## $ Balling           <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2~
## $ Pressure.Vacuum   <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4~
## $ PH                <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38~
## $ Oxygen.Filler     <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0~
## $ Bowl.Setpoint     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12~
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0~
## $ Air.Pressurer     <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146~
## $ Alch.Rel          <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52~
## $ Carb.Rel          <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34~
## $ Balling.Lvl       <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44~
## $ Dataset           <chr> "train", "train", "train", "train", "train", "train"~

From above, we gather that:

we’re dealing with training data with 33 variables (discluding our addition of a “dataset” variable) and 2571 observations,
a response variable pH of type double,
a Brand Code categorical variable,
remaining dependent variables of type int and double,
variables that may be useless (ie. Hyd.Pressure1 with all 0s),
a significant difference in the scale of many features (ie. Filler.Speed v. Oxygen.Filler), and
a presence of missing / NA values.

Let’s get a high level look at our distributions:

summary(train)

##   Brand.Code         Carb.Volume     Fill.Ounces      PC.Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb.Pressure     Carb.Temp          PSC             PSC.Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC.CO2           Mnf.Flow       Carb.Pressure1  Fill.Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd.Pressure1   Hyd.Pressure2   Hyd.Pressure3   Hyd.Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler.Level    Filler.Speed   Temperature      Usage.cont      Carb.Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure.Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen.Filler     Bowl.Setpoint   Pressure.Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air.Pressurer      Alch.Rel        Carb.Rel      Balling.Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1     
##    Dataset         
##  Length:2571       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

We note presence of negative values (ie. Pressure.Vacuum), difference of scales and distribution, and confirm the presence of NA values (including 212 values in MFR) across numerous variables.

Missing Values

Next, we investigate missing values in greater depth:

VIM::aggr(final_df %>% filter(Dataset == 'train'), col=c('green','red'), numbers=T, sortVars=T,
          cex.axis = .7,
          ylab=c("Proportion of Data", "Combinations and Percentiles"))

From the proportion chart above on the left we can see that:

MFR has ~8% missing values (the most of any variable),
a significant proportion of variables have 1-2% missing values (ie. Filler.Speed),
a significant proportion of variables have less than 1% missing data (ie. PSC.Fill), and
4 variables (ie. Brand.Code) have no missing data.

When we shift our attention to the Combinations and Percentiles chart (on the right), it appears that while there may be instances of related patterns between variables (ie. MFR and Fill.Pressure), the data is primarily missing at random.

Data Type Conversions

We’re primarily dealing with numeric variables but in order to properly utilize our categorical variable, we convert it to type factor. We re-label empty strings as “Unlabelled” and then convert our Brand.Code variable to be of type factor:

#Relabel missing brand codes to be "Unlabelled"
final_df$Brand.Code[final_df$Brand.Code == ""] <- NA
final_df$Brand.Code <- final_df$Brand.Code %>% tidyr::replace_na('Unlabelled')

#convert brand code to factor
final_df  <- final_df %>% dplyr::mutate(Brand.Code = factor(Brand.Code, levels = c('Unlabelled', 'A','B','C','D'), ordered = TRUE))

Numeric Variable Distributions

Earlier we’d noted the vast difference in ranges between numeric variables. To explore this point further and gain greater insight as to the distribution of each variable, we utilize inspectdf’s inspect_num() function:

#Variable distributions
inspectdf::inspect_num(final_df %>% filter(Dataset == 'train')) %>% 
  show_plot()

We observe a range of distributions:

Carb.Pressure, Carb.Temp, Carb,Volume, Fill.Ounces, PC.Volume, PH, Pressure.Vacuum, PSC, PSC.Fill, and Temperature have relatively normal spreads where just a few of the variables are skewed and would require centering / normalization,
Alch.Rel, Bowl.Setpoint, and Pressure.Setpoint appear to resemble distinct numerical variables with distributions centered about a few values, and
remaining distributions are non-normal and in need of scaling and centering.

From above, there may be a place for introducing dummy / flag variables (ie. Alch.Rel = 0.6) and that centering and scaling will prove essential prior to model-building. Fortunately, R has built in functions to center and scale automatically and en masse.

Categoric Contingency Table

Next, we create a contingency table for our categorical variable to better understand the relationship between Brand.Code and PH:

table(final_df$Brand.Code, final_df$PH)

##             
##              7.88 7.9 7.98  8 8.02 8.04 8.06 8.08 8.1 8.12 8.14 8.16 8.18 8.2
##   Unlabelled    0   0    0  0    0    0    0    0   0    1    0    0    1   2
##   A             0   0    0  0    0    0    1    0   1    2    1    4    3   6
##   B             0   0    0  1    1    3    4    3   0    3    1    0    5   1
##   C             1   1    1  1    2    0    1    3   4    5    7    7    4   3
##   D             0   0    0  0    0    0    0    0   0    0    0    1    0   0
##             
##              8.22 8.24 8.26 8.28 8.3 8.32 8.34 8.36 8.38 8.4 8.42 8.44 8.46
##   Unlabelled    4    4    8    5   5    2    2    0    0   1    2    1    7
##   A             6    8    6    4   6    3    4    5    3  10   13   10   24
##   B             6    7   12   13  13   11   13   37   42  54   47   47   56
##   C             3    4   11    9  19    6    8   12   12  16   15   14   16
##   D             0    1    0    2   1    3    2   12   10  28   17   17   25
##             
##              8.48 8.5 8.52 8.54 8.56 8.58 8.6 8.62 8.64 8.66 8.68 8.7 8.72 8.74
##   Unlabelled    7   8   10    9    6    7   5    1    4    2    1   1    2    1
##   A            13  11    6   18   13   16  29   17   10    6    7   3    3    4
##   B            48  53   47   74   38   51  45   27   37   24   37  68   58   55
##   C            13  18   10   13   17    9  13    6    2    1    2   4    3    1
##   D            20  36   20   24   26   26  35   34   30   37   30  38   34   20
##             
##              8.76 8.78 8.8 8.82 8.84 8.86 8.88 8.9 8.92 8.94 9.36
##   Unlabelled    2    3   0    0    2    2    2   0    0    0    0
##   A             6    4   1    2    2    2    0   0    0    0    0
##   B            44   33  38   27   18   15   12   2    2    2    0
##   C             0    4   2    0    0    0    0   0    0    0    1
##   D            15   18  18   13    9    5    2   1    5    0    0

From above we might extend that:

B is the most popular beverage brand,
that the distributons are quite sparse outside of the range of 8.3 thru 8.8, and
the brand of beverage appears to have some correlation to PH level.

Boxplots

With a basic understanding of the distribution of each of our features, we turn to the boxplot of each numeric distribution in order to visualize the magnitude of outliers:

# Utilize customized box plot generation function
df_train <- final_df %>% filter(Dataset == 'train')
boxplot_depend_vs_independ(df_train, df_train$PH)

What we gather from the output is:

our response variable (PH) has a range of ~8.4-8.7 with outlier values > 9.0 and ~ 8.0,
Mnf.Flow, Hyd.Pressure2, Hyd.Pressure3 Usage.cont, Carb.Flow, Density, Balling, Bowl.Setpoint, Pressure.Setpoint, and Balling.Lvl have no outliers,
Carb.Volume, Carb.Pressure, PSC.CO2, Hyd.Pressure1, Filler.Level, Pressure.Vacuum, and Alch.Rel have minimal outliers, and
remaining (15) variables have a significant presence of outliers.

As such, either outlier-handling will be essential in properly predicting our continuous, numeric response variable (PH) or we’re dealing with a non-linear relationship. Based on the high presence of “outliers” we feel that this is more-likely-than-not the case.

Correlation Matrix

Having reviewed the structure, distributions, and presence of outliers for our variables, we turn our attention to exploring the relationship these variables have with one another via correlation matrix. We consider only variables with a correlation significant > 0.1 in our plot:

#Utilize custom-built correlation matrix generation function
plot_corr_matrix(final_df %>% filter(Dataset == 'train'), -1)

It does not appear that multicollinearity is a concern. From this we might extend that feature exclusion based on multicollinearity will not be a concern.

Variable Importance

As a final step in the exploration of our data, we shoot to understand which variables have the strongest v. weakest relationship with PH level. This information may prove useful later during feature exclusion / selection:

# Perform Boruta search
boruta_output <- Boruta(PH ~ ., data=na.omit(train), doTrace=0, maxRuns = 1000)

# Get significant variables including tentatives
boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)

#print(boruta_signif)
# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")

Our Boruta output indicates that 4 of our features may be insignificant (PSC.CO2, PSC, PSC.Fill and Carb.Temp) while the remainder appear to be significant. With Mnf.Flow carrying the greatest variable importance (by a narrow margin).

EDA Summary

The training dataset has 33 variables and 2571 observations. The remediation actions became apparent over the course of our exploratory data analysis:

There are numerous features with a weak relationship to PH. While on the surface we might consider dropping these variables (ie. PSC.CO2), first we should see whether their importance increases once we consider other features during our regression / model-building.
Multicollinearity does not appear to be a concern.
There are outliers in the majority of our features. Either these will have to be addressed or the relationship between response and predictors is nonlinear.
Imputing missing values will be an important factor in our modeling. From our analysis, KNN may be a logical approach since beverages of similar compositions (PH levels) should have similar values.
Centering and scaling will likely be essential prior to model-building and especially depending on the approach we elect for predicting our continuous, numeric response variable (PH).

The recommendations above provide a “starting line” for our data preparation. Being that, the majority of our features appear to have a relatively strong relationship with PH, we’d anticipate being able to pull much signal from the variables at hand and developing a strong model.

Data Prep

With insights from exploratory data analysis, we set out to prepare our data for modeling. Our aim is to “optimize” our data for modeling. We’ll impute missing values, normalize our data, handle outliers via BoxCox transformation and then proceed to model building.

We utilize the preProcess function from the caret package to center, scale, impute, and apply a Box-Cox transformation. This will optimize the data set for the different linear and nonlinear models we’re going to build. It is important to note that Random Forest does not need the data to be normalized.

By scaling our data, it can smooth any nonlinear relationships in the model. If there are nonlinear relationships, by transforming the data, these non-linearities are not going to be reflected in the predictions.

It is important to normalize data for linear / non-linear regression as well as neural network models. Being that we’re going to consider a few different linear and nonlinear regression models in addition to a random forest model, we elected to created two data sets: one normalized and one not.

For normalization, we drop the Brand.Code, Dataset, and PH columns from consideration and utilize the preProcess() function to center, scale, impute via kNN imputation, and handle outliers via BoxCox transformation. Additionally, we ensure that we proceed with only complete PH values and then train-test split our training data so that we can verify its performance before later casting a prediction on the test set.

set.seed(150)

#center, scale, and impute
Processed <- preProcess(df_train2, method = c("knnImpute", "center", "scale", "BoxCox"))

#fill missing pH values and ensure we don't proceed with missing PH values for training set
complete_ph <- predict(Processed, df_train2)
complete_ph$PH = df_train$PH
complete_ph <- complete_ph %>% na.omit()

Model Building

In this section we’re going to consider linear, non-linear, and tree-based models to select first the strongest model within each sub-group and then the over-all top performing model.

Our EDA led us to believe that we’re dealing with non-linear data where the heavy presence of outliers may create problems for our linear and non-linear regression models. We anticipate the linear model will have the poorest performance because of the non-linear nature of our variables and the heavy presence of outliers while the non-linear models may struggle a bit because of the heavy skew that our outliers will have created within our models. We felt that the magnitude of outliers in addition to the apparent non-linearity of the data would mean that linear regression models would not perform well.

From this, we believe a tree-based model might be best but will consider linear and non-linear regression models for sake of consistency and in order to properly rule them out.

Linear Regression

With our data prepared, we consider a number of linear regression models: multi-linear regression, AIC optimized, and partial least squares. We utilize the same train() function for all three models, feeding the same datasets for X and Y, and specifying the proper model-building technique via the “method” variable:

Multi-linear regression

linear_model <- train(X.train.cs, Y.train.cs,
                      method='lm',
                      trControl=trainControl(method = "cv"))

AIC optimized

Partial Least Squares

pls_model <- train(X.train.cs, Y.train.cs,
                      method='pls',
                      metric='Rsquared',
                      tuneLength=10,
                      trControl=trainControl(method = "cv"))

With all three multi-linear regression models built, we proceed to casting our predictions and then verifying performance against the test portion of our train-test split data. Recall: it’s all been derived from our training set.

set.seed(150)

lmPred <- predict(linear_model, newdata = X.test.cs)
lmResample <- postResample(pred=lmPred, obs = Y.test.cs)

aicPred <-predict(aic_model, newdata=X.test.cs)
aicResample <- postResample(pred=aicPred, obs=Y.test.cs)

plsPred <-predict(pls_model, newdata=X.test.cs)
plsReSample <- postResample(pred=plsPred, obs = Y.test.cs)

The purpose of this section is to verify model performance and identify the strongest performing model in our multi linear regression subset. Based on what we observed during EDA (exploratory data analysis) we did not anticipate the linear regression models to perform well:

display <- rbind(
"Linear Regression" = lmResample,
"Stepwise AIC" = aicResample,
"Partial Least Squares" = plsReSample
)

display %>% kable() %>% kable_paper()

	RMSE	Rsquared	MAE
Linear Regression	0.1438062	0.3493612	0.1103118
Stepwise AIC	0.1439355	0.3482182	0.1100548
Partial Least Squares	0.1437415	0.3499468	0.1102135

Using Rsquared and MAE as the principal criteria for selecting the best model, the Linear Model produces the best metrics. However, the Rsquared value is very low and only ~35% of the variation in the output variable can be explained by the input variables. This is not good at all. All of the linear models produced very similar results, suggesting that the relationship between our target variable and the predictors is non-linear.

As a final measure we investigate the top ten influential predictors:

We see that Mnf.Flow, Temperature, Carb.Pressure1, Usage.cont, Hyd.Pressure2 and Filler.Level contribute the most to the pH levels according to the linear regression model. The linear model places heavy emphasis on the Mnf.Flow and equal importance to Temperature and Carb.Pressure1.

Non-Linear Regression

As a next natural step, we proceed to exploring the efficacy of non-linear models. We consider Multi-Adaptive Regression Spline (MARS), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) models and anticipate that our non-linear regression models will outperform their linear counterparts for the simple fact that it appears we’re dealing with non-linear variables and non-linear relationships between variables.

We utilize the train() function for all three models, feeding the same datasets for X and Y, tuning the grid (where applicable, ie. MARS) and specifying the proper model-building technique via “method” variable:

MARS

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
marsM <- train(X.train.cs, Y.train.cs,
                    method = "earth",
                    tuneGrid = marsGrid,
                    trControl = trainControl(method = "cv"))

Support Vector Machine

supModel <- train(X.train.cs, Y.train.cs,
 method = "svmRadial",
 tuneLength = 14,
 trControl = trainControl(method = "cv"))

KNN

knnModel <- train(x = X.train.cs,
                  y = Y.train.cs,
                  method = "knn",
                  tuneLength = 10)

With all three non-linear regression models built, we proceed to casting our predictions and then verifying performance against the test portion of our train-test split data. Recall: it’s all been derived from our training set.

set.seed(150)

marsPred <- predict(marsM, newdata = X.test.cs)
marsResample <- postResample(pred=marsPred, obs = Y.test.cs)

svmPred <-predict(supModel, newdata=X.test.cs)
svmResample <- postResample(pred=svmPred, obs=Y.test.cs)

knnPred <-predict(knnModel, newdata=X.test.cs)
knnResample <- postResample(pred=knnPred, obs=Y.test.cs)

The purpose of this section is to verify model performance and identify the strongest performing model in our non linear regression subset. Based on what we observed during EDA (exploratory data analysis) we anticipate our nonlinear models to perform better than our linear regression models although they may still have trouble with the fact that we were dealing with a high magnitude of outliers:

display <- rbind(
  "MARS" = marsResample,
  "Support Vector Machine" = svmResample,
  "KNN" = knnResample)

display %>% kable() %>% kable_paper()

	RMSE	Rsquared	MAE
MARS	0.1366654	0.4136367	0.1036735
Support Vector Machine	0.1280496	0.4854961	0.0921707
KNN	0.1362174	0.4216192	0.1019014

Using Rsquared and MAE as the principal criteria for selecting the best model, the Support Vector Machine (SVM) performed the best. However, the R-squared value is still below 0.50 which means we’re explaining less than 50% of the variation in the output variable with our input variables. This leaves a bit to be desired and so we look to the next section (tree based models) with optimism …

As a final measure we investigate the top ten influential predictors:

The non-linear model places high importance of Mnf.Flow just as the linear model, but there is a difference in the other variables. Usage.cont and Filler.Level are secondary and tertiary top contributors.

Tree-Based

The tree-based models will use non-normalized data in which outliers have not been handled because it’s within their capacity to handle both. As such, we repeat many of the pre-processing steps we’d done in preparing for the regression models: removing highly correlated data and variables that contain near zero variance, kNN imputation for missing values, ensuring non-NULL PH observations, and train-test splitting our data:

Once our data’s been brought into a proper form, we consider Random Forest, Cubist, and Gradient Boosting models in order to compare and contrast which may be the best tree-based model. As mentioned earlier, we had a hunch that one of the tree-based models would be our strongest performing model being that we were dealing with a high magnitude of outliers and what appeared to be non-linear data.

Random forest

rfModel <- randomForest(X.train, Y.train, importance=TRUE, ntree = 500)

Cubist

set.seed(100)
cubistTune <- train(X.train, Y.train,
    method = "cubist",
    verbose = FALSE)

Gradient boosting

gbmGrid <- expand.grid(
         interaction.depth = seq(1, 7, by = 2),
         n.trees = seq(100, 1000, by = 50),
         shrinkage = c(0.01, 0.1),
         n.minobsinnode = 10
         )
set.seed(100)
gbmTune <- train(X.train, Y.train,
    method = "gbm",
    tuneGrid = gbmGrid,
    verbose = FALSE)

set.seed(150)

randomPred <- predict(rfModel, newdata = X.test)
randomResample <- postResample(pred=randomPred, obs = Y.test)

gbmPred <-predict(gbmTune, newdata=X.test)
gbmResample <- postResample(pred=gbmPred, obs=Y.test)

cubistPred <-predict(cubistTune, newdata=X.test)
cubistSample <- postResample(pred=cubistPred, obs = Y.test)

The purpose of this section is to verify model performance and identify the strongest performing model in our tree based subset. Based on what we observed during EDA (exploratory data analysis) we anticipate these models performing quite well:

display <- rbind(
  "Random Forest" = randomResample,
  "Gradient Boosted Tree" = gbmResample,
  "Cubist" = cubistSample)

display %>% kable() %>% kable_paper()

	RMSE	Rsquared	MAE
Random Forest	0.1120980	0.6268775	0.0810325
Gradient Boosted Tree	0.1164534	0.5745456	0.0859115
Cubist	0.1032232	0.6655173	0.0753867

Using Rsquared and MAE as the principal criteria for selecting the best model, the Cubist model edges the others and very narrowly overshadows the performance of the Random Forest model. All of our tree-based models performed better than the linear and non-linear model and so our hypotheses were proven.

The R-squared for our Cubist model provides indication the model is capable of predicting over 65% of the variance of our data. This is close to the range at which we’d consider it a great model (0.90 >= R-squared >= 0.70) and so we feel relatively comfortable proceeding with the Cubist model as our optimal predictive model.

As a final measure we investigate the top ten influential predictors:

plot(varImp(cubistTune), top = 10)

This model places highest importance on Mnf.Flow and Alch.Rel followed by Density, Temperature and so on down the line. The plot highlights which variabls are most important to our model and thus in predicting pH levels.

From a chemical standpoint, it makes sense that the model would place a high importance on Alch.Rel and Carb.Rel when predicting PH levels, since those variables deal with oxygen and hydrogen (which either increase or decrease PH).

Model Selection & Forecast

We considered 9 different models of linear and non-linear regression type in addition to those of a tree-based type. We’d headed into our model building with the hypothesis that tree-based models would outperform nonlinear models would outperform linear models, and it was proven true.

Model Selection

The data we were to process had a relatively high magnitude of outliers in addition to nonlinear relationships between variables, and it was for this reason that we felt one of the tree-based models would be our best. While the Random Forest model is worth honorable mention for a strong predictive performance in addition to easy setup, the Cubist model outperformed the Random Forest model in both R-squared and mean absolute error (MAE).

When we consider the strongest linear v. nonlinear v. tree-based model we see:

display <- rbind(
  "Partial Least Squares" = plsReSample,
  "Support Vector Machine" = svmResample,
  "Cubist" = cubistSample
)

display %>% kable() %>% kable_paper()

	RMSE	Rsquared	MAE
Partial Least Squares	0.1437415	0.3499468	0.1102135
Support Vector Machine	0.1280496	0.4854961	0.0921707
Cubist	0.1032232	0.6655173	0.0753867

The Cubist (tree-based) model by far outperforms the strongest linear and nonlinear models.

We selected the Cubist model first based on performance and second based on ease-of-use (setup). The Cubist model performed the best on test data with regard to R-squared and mean absolute error (MAE). Additionally, it required less data pre-processing than our linear and nonlinear regression models and provided our best accuracy “out of the box” (no variable tweaking).

Cubist models are a powerful, rule-based model that balance the call for predictive accuracy with that of intelligibility and it’s for this reason we’re confident in its selection and capability.

Forecast

We proceed to predicting PH values and returning the provided test set with the NA values replaced by our Cubist model’s predictions:

set.seed(150)

#Prep data set: select test then drop dataset variable
final_test <- final_df %>% 
  filter(Dataset == 'test') %>% 
  dplyr::select(-Dataset)

#Cast predictions using model_8
predict <- round(predict(cubistTune, newdata=final_test),2)

#verify that we've cast predictions
#head(predict)
#summary(predict)

#send to Excel / CSV file
final_test$PH <- predict
head(final_test)

##   Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure Carb.Temp   PSC
## 1          D    5.480000    24.03333 0.2700000          65.4     134.6 0.236
## 2          A    5.393333    23.95333 0.2266667          63.2     135.0 0.042
## 3          B    5.293333    23.92000 0.3033333          66.4     140.4 0.068
## 4          B    5.266667    23.94000 0.1860000          64.8     139.0 0.004
## 5          B    5.406667    24.20000 0.1600000          69.4     142.2 0.040
## 6          B    5.286667    24.10667 0.2120000          73.4     147.2 0.078
##   PSC.Fill PSC.CO2 Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## 1     0.40    0.04     -100          116.6          46.0             0
## 2     0.22    0.08     -100          118.8          46.2             0
## 3     0.10    0.02     -100          120.2          45.8             0
## 4     0.20    0.02     -100          124.8          40.0             0
## 5     0.30    0.06     -100          115.0          51.4             0
## 6     0.22      NA     -100          118.6          46.4             0
##   Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## 1            NA            NA            96        129.4         3986
## 2             0             0           112        120.0         4012
## 3             0             0            98        119.4         4010
## 4             0             0           132        120.2           NA
## 5             0             0            94        116.0         4018
## 6             0             0            94        120.4         4010
##   Temperature Usage.cont Carb.Flow Density   MFR Balling Pressure.Vacuum   PH
## 1        66.0      21.66      2950    0.88 727.6   1.398            -3.8 9.02
## 2        65.6      17.60      2916    1.50 735.8   2.942            -4.4 9.02
## 3        65.6      24.18      3056    0.90 734.8   1.448            -4.2 9.02
## 4        74.4      18.12        28    0.74    NA   1.056            -4.0 9.18
## 5        66.4      21.32      3214    0.88 752.0   1.398            -4.0 9.02
## 6        66.6      18.00      3064    0.84 732.0   1.298            -3.8 9.02
##   Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel Carb.Rel
## 1         0.022           130              45.2         142.6     6.56     5.34
## 2         0.030           120              46.0         147.2     7.14     5.58
## 3         0.046           120              46.0         146.6     6.52     5.34
## 4            NA           120              46.0         146.4     6.48     5.50
## 5         0.082           120              50.0         145.8     6.50     5.38
## 6         0.064           120              46.0         146.0     6.50     5.42
##   Balling.Lvl
## 1        1.48
## 2        3.04
## 3        1.46
## 4        1.48
## 5        1.46
## 6        1.44

#write to csv
write.csv(final_test,"C:/Users/magnu/Documents/DATA-624//DATA624Proj2_PH_prediction.csv", row.names = FALSE)

We can see the first 6 predictions above, with PH predictions present in the 26th column. We’ve provided PH values of the same magnitude and value format as we were provided in the training data.

Variable Importance Commentary

The most important processes in impacting pH, based on our elected model and its overlap with the earlier explored Boruta function are: Mnf.Flow, Alch.Rel, Carb.Rel, and Temperature. Density was not as important in our general exploration of variable importance but it was important to our model. Putting an emphasis on these processes in the beverage making process would have the largest perceived impact on the pH levels of our beverages.

For example, we know that pH decreases when we increase temperature and so to decrease the pH level of a beverage leveraging this one process may reduce the pH level of our beverage (all else held constant). The powerful, and more complex measure, would be the interplay of these variables and their impact on others.

As a final measure, we consider visualizations of the relationship between our top predictors and PH:

p1 <- ggplot(train, aes(Mnf.Flow, PH)) + geom_point()
p2 <- ggplot(train, aes(Alch.Rel, PH)) + geom_point()
p3 <- ggplot(train, aes(Carb.Rel, PH)) + geom_point()
p4 <- ggplot(train, aes(Temperature, PH)) + geom_point()

grid.arrange(p1, p2, p3, p4, nrow = 2)

We observe that our data are concentrated between pH values of 8.0 and 9.0 with Mnf.Flow (minimum night flow) and Alch.Rel showing clustering characteristics along certain bands / levels.

Closing Remark

While a 1:1 relationship between top predictors and PH is not clear (as highlighted in the above visualizations), what is clear is that they are the most important variables for controlling pH levels, they are heavily concentrated within certain ranges, (ie. Temperature between 64 and 68 degrees F) and all beverage brands sit at a pH level greater than 8, resulting in alkaline beverages.

Provided that the relationships between predictors and response are not easily understood, using a more complex algorithm like Cubist is warranted and favorable. We selected the Cubist model for its balance of predictive prowess, interpret-ability, as well as ease of use. It was our strongest performing model and the one we felt most confident in as highlighted in this closing section.

DATA624P2

Gabriel Abreu, Magnus Skonberg

2021-12-10

Background

ABC Beverage Spec

Approach

Exploratory Data Analysis (EDA)

Data Structure

Missing Values

Data Type Conversions

Numeric Variable Distributions

Categoric Contingency Table

Boxplots

Correlation Matrix

Variable Importance

EDA Summary

Data Prep

Model Building

Linear Regression

Multi-linear regression

AIC optimized

Partial Least Squares

Non-Linear Regression

MARS

Support Vector Machine

KNN

Tree-Based

Random forest

Cubist

Gradient boosting

Model Selection & Forecast

Model Selection

Forecast

Variable Importance Commentary

Closing Remark