Technical Report: Determinants of PH in Beverage Process

Alexander Ng, Philip Tanofsky

Due 12/13/2021

Introduction

This technical report tunes multiple models for prediction of beverage PH and identifies determinants of PH among predictors in the manufacturing process. We train and test 5 models for this purpose including the Cubist, Gradient Boosted Tree, CART, MARS and linear regression models.

The training and test performance are evaluated separately on the StudentData dataset with the metrics RMSE and \(R^2\) to determine the model algorithm most appropriate for predicting the PH of the StudentEvaluation dataset.

Data Wrangling Strategy

We will describe the data wrangling steps in brief before doing any exploratory data analysis and model building. After loading the raw data files, our transformations and changes are as follows:

  1. Add a primary key id to allow unique identification of all raw observations.
  2. Drop observations with NA in the response variable PH.
  3. Drop observations with a high number of missing values. In our case, observations that have more than 4 missing values.
  4. Rename the variables to remove spaces.
  5. Split the initial data set using stratified sampling based on PH value into a 80/20 train and validate data set. We recognize that a test data set exists but this is for external assessment of the project prediction accuracy.
  6. Identify and drop near zero variance predictors in the training set. And apply the same predictor removals to the validation data set (regardless of their values in the latter).
  7. Identify and fix zeros and outliers in the predictors in the training set as follows:
  1. Handle special cases of variables:
  1. Apply the knn imputation strategy to Box-Cox transformed, scaled, centered observations for all training set points.

Along the way, we will assess the numerical impact of such changes to the dataset.

## [1] 2571   33

Let’s evaluate the data characteristics of the raw file after loading. We use the skimr library to get statistics quickly for the entire training data set. The table below shows the numeric variables sorted by their completion rate in descending order. So the most problematic predictors are at the top of the table.

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MFR 212 0.918 704.049 73.898 31.400 706.300 724.000 731.000 868.600 ▁▁▁▂▇
Filler Speed 57 0.978 3687.199 770.820 998.000 3888.000 3982.000 3998.000 4030.000 ▁▁▁▁▇
PC Volume 39 0.985 0.277 0.061 0.079 0.239 0.271 0.312 0.478 ▁▃▇▂▁
PSC CO2 39 0.985 0.056 0.043 0.000 0.020 0.040 0.080 0.240 ▇▅▂▁▁
Fill Ounces 38 0.985 23.975 0.088 23.633 23.920 23.973 24.027 24.320 ▁▂▇▂▁
PSC 33 0.987 0.085 0.049 0.002 0.048 0.076 0.112 0.270 ▆▇▃▁▁
Carb Pressure1 32 0.988 122.586 4.743 105.600 119.000 123.200 125.400 140.200 ▁▃▇▂▁
Hyd Pressure4 30 0.988 96.289 13.123 52.000 86.000 96.000 102.000 142.000 ▁▃▇▂▁
Carb Pressure 27 0.989 68.190 3.538 57.000 65.600 68.200 70.600 79.400 ▁▅▇▃▁
Carb Temp 26 0.990 141.095 4.037 128.600 138.400 140.800 143.800 154.000 ▁▅▇▃▁
PSC Fill 23 0.991 0.195 0.118 0.000 0.100 0.180 0.260 0.620 ▆▇▃▁▁
Fill Pressure 22 0.991 47.922 3.178 34.600 46.000 46.400 50.000 60.400 ▁▁▇▂▁
Filler Level 20 0.992 109.252 15.698 55.800 98.300 118.400 120.000 161.200 ▁▃▅▇▁
Hyd Pressure2 15 0.994 20.961 16.386 0.000 0.000 28.600 34.600 59.400 ▇▂▇▅▁
Hyd Pressure3 15 0.994 20.458 15.976 -1.200 0.000 27.600 33.400 50.000 ▇▁▃▇▁
Temperature 14 0.995 65.968 1.383 63.600 65.200 65.600 66.400 76.200 ▇▃▁▁▁
Oxygen Filler 12 0.995 0.047 0.047 0.002 0.022 0.033 0.060 0.400 ▇▁▁▁▁
Pressure Setpoint 12 0.995 47.615 2.039 44.000 46.000 46.000 50.000 52.000 ▁▇▁▆▁
Hyd Pressure1 11 0.996 12.438 12.433 -0.800 0.000 11.400 20.200 58.000 ▇▅▂▁▁
Carb Volume 10 0.996 5.370 0.106 5.040 5.293 5.347 5.453 5.700 ▁▆▇▅▁
Carb Rel 10 0.996 5.437 0.129 4.960 5.340 5.400 5.540 6.060 ▁▇▇▂▁
Alch Rel 9 0.996 6.897 0.505 5.280 6.540 6.560 7.240 8.620 ▁▇▂▃▁
Usage cont 5 0.998 20.993 2.978 12.080 18.360 21.790 23.755 25.900 ▁▃▅▃▇
PH 4 0.998 8.546 0.173 7.880 8.440 8.540 8.680 9.360 ▁▅▇▂▁
Mnf Flow 2 0.999 24.569 119.481 -100.200 -100.000 65.200 140.800 229.400 ▇▁▁▇▂
Carb Flow 2 0.999 2468.354 1073.696 26.000 1144.000 3028.000 3186.000 5104.000 ▂▅▆▇▁
Bowl Setpoint 2 0.999 109.327 15.303 70.000 100.000 120.000 120.000 140.000 ▁▂▃▇▁
Density 1 1.000 1.174 0.378 0.240 0.900 0.980 1.620 1.920 ▁▅▇▂▆
Balling 1 1.000 2.198 0.931 -0.170 1.496 1.648 3.292 4.012 ▁▇▇▁▇
Balling Lvl 1 1.000 2.050 0.870 0.000 1.380 1.480 3.140 3.660 ▁▇▂▁▆
Pressure Vacuum 0 1.000 -5.216 0.570 -6.600 -5.600 -5.400 -5.000 -3.600 ▂▇▆▂▁
Air Pressurer 0 1.000 142.834 1.212 140.800 142.200 142.600 143.000 148.200 ▅▇▁▁▁

Next, we display the one character variable Brand below and its skimr output below.

Character Variables in Beverage Data

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand Code 120 0.953 1 1 0 4 0

After loading the raw Excel file, we find the training dataset has 2571 observations with 32 predictors and 1 response variable PH.

General observations about the skimr results:

Our first transformation below is to add a unique identifier id based on row number to obtain a primary key.
Then we drop the 4 observations which are missing PH values, rearrange the response variable PH and BrandCode for our convenience. Lastly, we also eliminate spaces from the predictor column names to avoid using quote in our code.

raw_StudentData %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` ) %>%
  filter( !is.na(PH)) %>% as_tibble() ->  sdata_v1

dim(sdata_v1)
## [1] 2567   34

Rows with Missing Data

Is there any observations with a high number of missing observations?

# include any rows where any column is NA.  Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id  ) %>% relocate(id)-> missing_train_rows
  
dim(missing_train_rows)
## [1] 529  28

We find there are 529 observations with NA observations. So 20.6% have NA values. We conclude that the NA values are sufficiently well distributed that removal of all observations with NA values is impractical.

In the marginal table below, we investigate if certain observations have a high incidence of NA values amongst their predictors. The answer shows that the incident of multiple

Number of Rows with NA

num_na count
1 345
2 119
3 45
4 13
5 3
6 2
7 1
8 1
## [1] "There are 812 cells with NA values out of 82144 equivalent to: 0.99%"

Overall, only 1 percent of cells have missing values.
The concentration of missing values seems to decline at a geometric ratio suggesting independence among the occurrence of missing values.

Evaluation of Missing Data in the Test Set

## [1] 267  33
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MFR 31 0.884 697.801 96.400 15.600 707.000 724.600 731.450 784.800 ▁▁▁▁▇
Filler Speed 10 0.963 3581.385 911.187 1006.000 3812.000 3978.000 3996.000 4020.000 ▁▁▁▁▇
Fill Ounces 6 0.978 23.970 0.076 23.747 23.920 23.967 24.013 24.200 ▁▅▇▃▁
PSC 5 0.981 0.085 0.053 0.004 0.044 0.076 0.112 0.246 ▆▇▃▂▁
PSC CO2 5 0.981 0.051 0.038 0.000 0.020 0.040 0.060 0.240 ▇▃▂▁▁
PC Volume 4 0.985 0.278 0.063 0.099 0.233 0.275 0.322 0.464 ▁▆▇▅▁
Carb Pressure1 4 0.985 123.040 4.420 113.000 120.200 123.400 125.500 136.000 ▃▃▇▂▁
Hyd Pressure4 4 0.985 97.840 13.921 68.000 90.000 98.000 104.000 140.000 ▅▆▇▂▁
PSC Fill 3 0.989 0.190 0.113 0.020 0.100 0.180 0.260 0.620 ▇▇▃▁▁
Oxygen Filler 3 0.989 0.047 0.050 0.002 0.020 0.034 0.054 0.398 ▇▁▁▁▁
Alch Rel 3 0.989 6.907 0.498 6.400 6.540 6.580 7.180 7.820 ▇▁▂▁▃
Fill Pressure 2 0.993 48.141 3.440 37.800 46.000 47.800 50.200 60.200 ▁▇▇▂▁
Filler Level 2 0.993 110.292 15.502 69.200 100.600 118.600 120.200 153.200 ▂▃▇▇▁
Temperature 2 0.993 66.227 1.692 63.800 65.400 65.800 66.600 75.400 ▇▅▁▁▁
Usage cont 2 0.993 20.897 2.999 12.900 18.120 21.440 23.740 24.600 ▁▃▃▃▇
Pressure Setpoint 2 0.993 47.733 2.057 44.000 46.000 46.000 50.000 52.000 ▁▇▁▆▁
Carb Rel 2 0.993 5.440 0.127 5.180 5.340 5.400 5.560 5.740 ▂▇▂▃▂
Carb Volume 1 0.996 5.369 0.111 5.147 5.287 5.340 5.465 5.667 ▂▇▃▅▁
Carb Temp 1 0.996 141.226 4.296 130.000 138.400 140.800 143.800 154.000 ▁▆▇▃▁
Hyd Pressure2 1 0.996 20.113 17.214 -50.000 0.000 26.800 34.800 61.400 ▁▁▆▇▁
Hyd Pressure3 1 0.996 19.609 16.557 -50.000 0.000 27.700 33.000 49.200 ▁▁▆▃▇
Density 1 0.996 1.177 0.382 0.060 0.920 0.980 1.600 1.840 ▁▁▇▁▅
Balling 1 0.996 2.203 0.925 0.902 1.498 1.648 3.242 3.788 ▅▇▁▂▅
Pressure Vacuum 1 0.996 -5.174 0.582 -6.400 -5.600 -5.200 -4.800 -3.600 ▁▇▆▃▁
Bowl Setpoint 1 0.996 109.624 15.017 70.000 100.000 120.000 120.000 130.000 ▁▂▁▃▇
Air Pressurer 1 0.996 142.831 1.231 141.200 142.200 142.600 142.800 147.200 ▅▇▁▁▁
Carb Pressure 0 1.000 68.255 3.857 60.200 65.300 68.000 70.600 77.600 ▃▆▇▃▂
Mnf Flow 0 1.000 21.034 117.756 -100.200 -100.000 0.200 141.300 220.400 ▇▁▁▆▂
Hyd Pressure1 0 1.000 12.010 13.531 -50.000 0.000 10.400 20.400 50.000 ▁▁▇▆▂
Carb Flow 0 1.000 2408.644 1161.364 0.000 1083.000 3038.000 3215.000 3858.000 ▂▃▁▆▇
Balling Lvl 0 1.000 2.051 0.883 0.000 1.380 1.480 3.080 3.420 ▁▃▇▁▇

Character Variables in Beverage Data

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand Code 8 0.97 1 1 0 4 0
## [1] 267  34
# include any rows where any column is NA.  Always include id column
#
#  NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
#  for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id  ) %>% relocate(id)-> missing_test_rows
  
dim(missing_test_rows)
## [1] 67 28

Number of Rows with NA

num_na count
1 50
2 11
3 3
4 1
8 1
14 1
## [1] "There are 107 cells with NA values (excluding PH) out of 8544 equivalent to: 1.25%"

We conclude that the distribution of missing values in the test set is:

Exploratory Data Analysis

Scatterplot of Predictor Variables

The scatterplots of the predictor variables against the dependent variable PH indicate no clear linear relationship with any predictor variable. The scatterplots indicate some variable measurements are not continuous and instead fall on specific values on the x-axis.

Density Plots

The density plots of each predictor variable show many predictor variables with non-normal distributions.

Normality Test

The Shapiro-Wilk test of normality is performed on all the numeric predictor variables and the resulting list confirms no p-value above 0.05 and thus no predictor variable follows a normal distribution per evaluation.

## [1] 0

Distribution by Brand Code

The distribution of the Brand Code, the lone non-numeric predictor variable, shows a high count of brand B, comprising almost half of the samples. Also note, some values are missing for the Brand Code variable.

Boxplot of PH by Brand Code

The boxplot of PH values by Brand Code certainly captures overlap of the middle two quartiles, which seems reasonable given the PH range is rather narrow. Of note, the median values of the 4 labeled brands and the unlabeled group do appear distinct besides Brand A and the unlabeled.

VSURF

To find significant variables for the models, the VSURF algorithm is applied to the provided dataset. The algorithm identifies 14 variables as significant to the predictions.

## Thresholding step
## Estimated computational time (on one core): 12.3 sec.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |======================================================================| 100%
## Interpretation step (on 26 variables)
## Estimated computational time (on one core): between 799.5 sec. and  2356.1 sec.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%
## Prediction step (on 17 variables)
## Maximum estimated computational time (on one core): 970.2 sec.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |================                                                      |  24%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |======================================================                |  76%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |======================================================================| 100%
## ** VSURF results **
## The results object is a list of length around 20.
## Most interesting components are the following:
## 
##   name                description                                       
## 1 "$varselect.thres"  "variables selected after thesholding step"       
## 2 "$varselect.interp" "variables selected after interpretation step"    
## 3 "$varselect.pred"   "variables selected after prediction step"        
## 4 "$ord.imp$x"        "mean VI in decreasing order for all variables"   
## 5 "$ord.imp$ix"       "indices of the ordering of all variables VI mean"
## 6 "mean.perf"         "mean OOB rate for RF build on all variables"     
## 
##  For more information about VSURF outputs see the VSURF help page
## 
##  VSURF computation time: 36.2 mins 
## 
##  VSURF selected: 
##  26 variables at thresholding step (in 7.4 secs)
##  17 variables at interpretation step (in 23.4 mins)
##  16 variables at prediction step (in 12.6 mins)
##  [1]  9 19 30 16 29 26 31 20 18 23 25 24 27 17 12 28

Data Partition into Training and Test Sets

Next we partition the student evaluation data set by training and test sets. We will retain the terminology of test set to refer to the validation data set used for model performance assessment outside of the cross validation process.

set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )

sdata_test    = sdata_v1[-train_indices , ]
sdata_train   = sdata_v1[ train_indices , ]
sdataY_test   = sdata_v1$PH[-train_indices ]
sdataY_train  = sdata_v1$PH[ train_indices ]

Drop Near Zero Variance Predictors

We conclude there are no near zero variance predictors or constant predictors to be dropped from the training data set.

No Near Zero Variance Predictors in Training

freqRatio percentUnique zeroVar nzv
id 1.00 100.00 FALSE FALSE
PH 1.05 2.48 FALSE FALSE
CarbVolume 1.03 4.72 FALSE FALSE
FillOunces 1.18 4.38 FALSE FALSE
PCVolume 1.19 21.22 FALSE FALSE
CarbPressure 1.10 5.06 FALSE FALSE
CarbTemp 1.04 5.99 FALSE FALSE
PSC 1.14 6.23 FALSE FALSE
PSCFill 1.11 1.56 FALSE FALSE
PSCCO2 1.08 0.63 FALSE FALSE
MnfFlow 1.08 21.75 FALSE FALSE
CarbPressure1 1.02 6.37 FALSE FALSE
FillPressure 1.78 5.06 FALSE FALSE
HydPressure1 31.86 11.58 FALSE FALSE
HydPressure2 7.63 9.39 FALSE FALSE
HydPressure3 12.71 8.91 FALSE FALSE
HydPressure4 1.02 1.90 FALSE FALSE
FillerLevel 1.05 12.90 FALSE FALSE
FillerSpeed 1.00 10.41 FALSE FALSE
Temperature 1.05 2.63 FALSE FALSE
Usagecont 1.12 22.43 FALSE FALSE
CarbFlow 1.33 23.21 FALSE FALSE
Density 1.19 3.65 FALSE FALSE
MFR 1.04 25.94 FALSE FALSE
Balling 1.24 9.34 FALSE FALSE
PressureVacuum 1.43 0.78 FALSE FALSE
OxygenFiller 1.32 15.67 FALSE FALSE
BowlSetpoint 2.80 0.49 FALSE FALSE
PressureSetpoint 1.33 0.39 FALSE FALSE
AirPressurer 1.03 1.41 FALSE FALSE
AlchRel 1.10 2.53 FALSE FALSE
CarbRel 1.02 1.95 FALSE FALSE
BallingLvl 1.23 3.89 FALSE FALSE
BrandCode 2.01 0.19 FALSE FALSE

Outliers and Zeros

Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum.

\[ \text{zscore_min}(X_i) = \frac{ min(X_i) - \mu(X_i) }{\sigma(X_i)}\] \[\text{zscore_max}(X_i) = \frac{ max(X_i) - \mu(X_i) }{\sigma(X_i )}\]

Predictors with Outliers by Z-Score

skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.hist zscore_min zscore_max
MFR 156 0.92 704.15 74.72 ▁▁▁▂▇ -9.00 2.20
OxygenFiller 7 1.00 0.05 0.04 ▇▁▁▁▁ -0.97 7.88
Temperature 8 1.00 65.95 1.36 ▇▃▁▁▁ -1.73 7.55
CarbRel 6 1.00 5.44 0.13 ▁▇▇▂▁ -3.70 4.84
AirPressurer 0 1.00 142.83 1.20 ▇▇▁▁▁ -1.52 4.48
PSCCO2 29 0.99 0.06 0.04 ▇▅▂▁▁ -1.31 4.31
FillPressure 15 0.99 47.91 3.14 ▁▁▇▅▁ -4.23 3.72

Using scatterplots of each of the 7 identified predictors below, we see no obvious data errors exist in the distribution of the predictors with outlier values.
Isolated points exist but lie within a reasonable proximity to neighboring points in the scatter plot. Therefore no data corrections for outliers will be applied to the training data.

Handle Special Cases

The code below will transform the special case situations of data pre-processing identified earlier:

These rules are applied consistently to be the training, validation and test data sets prior to subsequent pre processing.

KNN Imputation

We apply the caret pre processing function to that training, validation and test data sets.

# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected.  They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale",  "knnImpute") )

# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc,  as.data.frame(sdata_v2_train ) )

# Becomes the final version of test data for validation
sdata_test_pp  = predict( preProcFunc,  as.data.frame(sdata_v2_test) )

# Need to generate predictions based on test data without known response
edata_test_pp  = predict( preProcFunc,  as.data.frame(edata_v2) )


sdata_trainX_pp  = sdata_train_pp %>% select( -PH)
sdata_testX_pp   = sdata_test_pp %>% select(-PH)
edata_testX_pp   = edata_test_pp %>% select( -PH)
Data summary
Name sdata_train_pp
Number of rows 2055
Number of columns 34
_______________________
Column type frequency:
character 1
numeric 33
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BrandCode 0 1 1 1 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 1297.57 751.06 1.00 644.50 1296.00 1961.00 2568.00 ▇▇▇▇▇
PH 0 1 8.55 0.17 7.88 8.44 8.54 8.68 9.36 ▁▅▇▁▁
CarbVolume 0 1 0.00 1.00 -3.42 -0.71 -0.26 0.86 2.90 ▁▃▇▅▁
FillOunces 0 1 0.00 0.99 -3.88 -0.62 -0.01 0.60 3.91 ▁▂▇▂▁
PCVolume 0 1 0.01 1.00 -3.72 -0.60 -0.07 0.59 3.01 ▁▂▇▃▁
CarbPressure 0 1 0.01 1.00 -3.11 -0.72 0.04 0.70 2.88 ▁▃▇▅▁
CarbTemp 0 1 0.01 1.00 -3.47 -0.65 0.01 0.69 2.90 ▁▂▇▅▁
PSC 0 1 0.00 0.99 -2.75 -0.65 0.01 0.66 2.83 ▁▅▇▅▁
PSCFill 0 1 0.00 1.00 -1.66 -0.81 -0.13 0.54 3.60 ▆▇▃▁▁
PSCCO2 0 1 0.00 0.99 -1.31 -0.84 -0.37 0.33 4.31 ▇▅▂▁▁
MnfFlow 0 1 0.00 1.00 -1.05 -1.05 0.50 0.97 1.58 ▇▁▁▆▂
CarbPressure1 0 1 -0.01 1.00 -3.63 -0.77 0.13 0.60 3.77 ▁▃▇▂▁
FillPressure 0 1 0.00 1.00 -5.38 -0.58 -0.45 0.70 3.23 ▁▁▇▆▁
HydPressure1 0 1 0.00 1.00 -1.07 -1.01 -0.08 0.65 3.66 ▇▅▂▁▁
HydPressure2 0 1 0.00 1.00 -1.29 -1.29 0.47 0.83 2.34 ▇▂▇▅▁
HydPressure3 0 1 0.00 1.00 -1.37 -1.29 0.44 0.80 1.84 ▇▁▃▇▁
HydPressure4 0 1 0.01 1.00 -3.40 -0.76 0.07 0.52 2.82 ▁▅▇▇▁
FillerLevel 0 1 0.00 1.00 -2.85 -0.72 0.54 0.68 4.30 ▁▅▇▁▁
FillerSpeed 0 1 -0.05 1.05 -3.44 0.20 0.41 0.44 0.51 ▁▁▁▁▇
Temperature 0 1 0.00 1.00 -1.91 -0.57 -0.25 0.38 6.50 ▇▆▁▁▁
Usagecont 0 1 0.00 1.00 -2.46 -0.92 0.20 0.95 1.83 ▁▇▅▇▆
CarbFlow 0 1 0.00 1.00 -2.02 -1.33 0.52 0.69 2.99 ▃▁▇▁▁
Density 0 1 0.00 1.00 -4.85 -0.68 -0.41 1.18 1.71 ▁▁▁▇▅
MFR 0 1 0.00 1.00 -6.70 -0.02 0.29 0.41 3.35 ▁▁▁▇▁
Balling 0 1 0.00 1.00 -6.22 -0.74 -0.50 1.19 1.68 ▁▁▁▇▅
PressureVacuum 0 1 0.00 1.00 -2.42 -0.67 -0.31 0.39 2.84 ▂▇▅▃▂
OxygenFiller 0 1 0.00 1.00 -1.98 -0.35 0.03 0.62 3.21 ▃▇▆▂▁
BowlSetpoint 0 1 0.00 1.00 -2.39 -0.73 0.70 0.70 2.40 ▁▃▃▇▁
PressureSetpoint 0 1 0.00 1.00 -1.94 -0.77 -0.77 1.16 1.96 ▁▇▁▆▁
AirPressurer 0 1 0.00 1.00 -1.58 -0.53 -0.18 0.16 4.37 ▅▇▁▁▁
AlchRel 0 1 0.00 1.00 -5.09 -0.72 -0.62 0.79 2.76 ▁▁▇▂▃
CarbRel 0 1 0.00 1.00 -4.28 -0.75 -0.26 0.97 4.21 ▁▂▇▅▁
BallingLvl 0 1 0.00 1.00 -2.36 -0.77 -0.66 1.26 1.86 ▁▇▂▁▆
Data summary
Name sdata_test_pp
Number of rows 512
Number of columns 34
_______________________
Column type frequency:
character 1
numeric 33
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BrandCode 0 1 1 1 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 1237.47 705.94 5.00 638.00 1248.50 1803.75 2571.00 ▇▇▇▇▆
PH 0 1 8.55 0.17 7.98 8.44 8.54 8.68 8.94 ▁▃▇▇▃
CarbVolume 0 1 0.01 1.01 -3.12 -0.71 -0.19 0.80 2.52 ▁▃▇▅▂
FillOunces 0 1 0.03 1.01 -3.20 -0.55 -0.01 0.60 3.99 ▁▅▇▂▁
PCVolume 0 1 -0.01 0.93 -3.47 -0.55 -0.04 0.58 2.79 ▁▂▇▅▁
CarbPressure 0 1 0.00 0.99 -3.54 -0.72 -0.02 0.71 2.74 ▁▂▇▇▁
CarbTemp 0 1 -0.02 1.00 -3.47 -0.65 -0.09 0.64 2.90 ▁▂▇▅▁
PSC 0 1 -0.05 1.06 -2.53 -0.81 -0.08 0.67 2.76 ▂▇▇▅▂
PSCFill 0 1 -0.02 0.99 -1.66 -0.81 -0.13 0.54 3.43 ▆▇▃▂▁
PSCCO2 0 1 0.07 1.03 -1.31 -0.84 -0.37 0.56 4.31 ▇▅▂▁▁
MnfFlow 0 1 -0.06 0.98 -1.05 -1.05 -0.22 0.92 1.69 ▇▁▁▇▁
CarbPressure1 0 1 -0.02 1.06 -3.46 -0.90 0.09 0.65 3.34 ▁▅▇▅▁
FillPressure 0 1 0.02 1.04 -3.82 -0.58 -0.45 0.70 3.41 ▁▁▇▅▁
HydPressure1 0 1 -0.03 1.01 -1.07 -1.01 -0.12 0.57 3.21 ▇▅▂▁▁
HydPressure2 0 1 -0.04 1.00 -1.29 -1.29 0.42 0.81 2.08 ▇▁▆▆▁
HydPressure3 0 1 -0.05 1.00 -1.37 -1.29 0.40 0.79 1.81 ▇▁▃▇▁
HydPressure4 0 1 0.07 0.99 -2.16 -0.58 0.23 0.52 2.82 ▂▃▇▂▁
FillerLevel 0 1 -0.07 1.05 -2.48 -1.22 0.57 0.69 3.08 ▂▃▇▁▁
FillerSpeed 0 1 -0.16 1.18 -3.43 0.05 0.35 0.44 0.48 ▁▁▁▁▇
Temperature 0 1 0.06 1.05 -1.91 -0.57 -0.10 0.38 6.30 ▆▇▁▁▁
Usagecont 0 1 -0.05 1.01 -2.54 -1.04 0.19 0.94 1.78 ▁▇▅▇▇
CarbFlow 0 1 0.08 1.00 -2.02 -1.25 0.54 0.74 1.43 ▃▁▁▇▅
Density 0 1 0.02 1.00 -4.14 -0.68 -0.41 1.18 1.68 ▁▁▆▇▇
MFR 0 1 -0.01 0.97 -6.40 -0.04 0.29 0.39 2.91 ▁▁▁▇▁
Balling 0 1 -0.01 1.01 -3.70 -0.82 -0.50 1.19 1.63 ▁▁▇▂▆
PressureVacuum 0 1 0.04 1.00 -2.42 -0.67 0.04 0.39 2.84 ▂▇▆▃▂
OxygenFiller 0 1 0.09 0.98 -1.98 -0.27 0.08 0.69 3.21 ▃▇▇▂▁
BowlSetpoint 0 1 -0.09 1.08 -2.39 -1.35 0.70 0.70 2.40 ▂▃▂▇▁
PressureSetpoint 0 1 0.02 0.98 -1.94 -0.77 -0.77 1.16 1.28 ▁▇▁▁▆
AirPressurer 0 1 0.04 1.05 -1.76 -0.53 -0.18 0.16 3.91 ▂▇▁▁▁
AlchRel 0 1 0.00 0.99 -1.08 -0.72 -0.67 0.68 2.74 ▇▁▂▃▁
CarbRel 0 1 -0.01 1.00 -3.67 -0.75 -0.26 0.83 3.02 ▁▂▇▅▁
BallingLvl 0 1 0.01 1.01 -2.36 -0.77 -0.68 1.26 1.65 ▁▆▇▁▇
Data summary
Name edata_test_pp
Number of rows 267
Number of columns 34
_______________________
Column type frequency:
character 1
logical 1
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BrandCode 0 1 1 1 0 5 0

Variable type: logical

skim_variable n_missing complete_rate mean count
PH 267 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 134.00 77.22 1.00 67.50 134.00 200.50 267.00 ▇▇▇▇▇
CarbVolume 0 1 -0.02 1.04 -2.23 -0.81 -0.26 0.89 2.63 ▂▇▅▅▁
FillOunces 0 1 -0.04 0.86 -2.60 -0.62 -0.09 0.45 2.60 ▁▅▇▃▁
PCVolume 0 1 0.01 1.04 -3.28 -0.70 0.01 0.75 2.82 ▁▃▇▇▁
CarbPressure 0 1 0.01 1.08 -2.42 -0.80 -0.02 0.70 2.46 ▂▆▇▅▂
CarbTemp 0 1 0.03 1.06 -3.04 -0.65 -0.04 0.71 2.90 ▁▃▇▃▁
PSC 0 1 -0.01 1.08 -2.53 -0.75 0.01 0.66 2.55 ▂▅▇▃▂
PSCFill 0 1 -0.05 0.96 -1.49 -0.73 -0.13 0.46 3.60 ▆▇▃▁▁
PSCCO2 0 1 -0.12 0.89 -1.31 -0.84 -0.37 0.09 4.31 ▇▃▂▁▁
MnfFlow 0 1 -0.04 0.98 -1.05 -1.05 -0.22 0.96 1.62 ▇▁▁▆▂
CarbPressure1 0 1 0.09 0.94 -2.05 -0.53 0.17 0.60 2.87 ▂▃▇▂▁
FillPressure 0 1 0.06 1.10 -3.82 -0.58 0.01 0.76 3.37 ▁▁▇▆▁
HydPressure1 0 1 -0.04 1.09 -5.04 -1.01 -0.17 0.63 3.02 ▁▁▇▆▂
HydPressure2 0 1 -0.07 1.05 -4.34 -1.29 0.35 0.83 2.46 ▁▁▆▇▁
HydPressure3 0 1 -0.07 1.04 -4.42 -1.29 0.44 0.77 1.79 ▁▁▆▃▇
HydPressure4 0 1 0.13 1.05 -2.63 -0.41 0.23 0.66 2.73 ▁▂▇▃▂
FillerLevel 0 1 0.06 1.01 -2.32 -0.66 0.56 0.69 3.51 ▃▃▇▁▁
FillerSpeed 0 1 -0.23 1.25 -3.43 0.04 0.28 0.44 0.48 ▁▁▁▁▇
Temperature 0 1 0.21 1.20 -1.74 -0.40 -0.09 0.53 6.09 ▇▇▂▁▁
Usagecont 0 1 -0.05 1.01 -2.37 -1.02 0.06 0.94 1.29 ▁▅▃▂▇
CarbFlow 0 1 -0.03 1.07 -2.02 -1.36 0.53 0.73 1.46 ▅▁▁▇▃
Density 0 1 -0.02 1.14 -9.22 -0.64 -0.41 1.14 1.58 ▁▁▁▃▇
MFR 0 1 -0.05 1.16 -6.71 -0.01 0.29 0.41 1.50 ▁▁▁▁▇
Balling 0 1 0.00 1.00 -1.98 -0.74 -0.50 1.16 1.54 ▁▇▃▁▇
PressureVacuum 0 1 0.08 1.02 -2.07 -0.67 0.04 0.74 2.84 ▁▇▆▃▁
OxygenFiller 0 1 0.02 0.99 -1.98 -0.41 0.05 0.55 3.20 ▅▇▇▂▁
BowlSetpoint 0 1 0.00 0.99 -2.39 -0.73 0.70 0.70 1.52 ▂▂▃▇▁
PressureSetpoint 0 1 0.06 1.00 -1.94 -0.77 -0.77 1.16 1.96 ▁▇▁▆▁
AirPressurer 0 1 0.01 1.03 -1.40 -0.53 -0.18 -0.01 3.60 ▅▇▁▁▁
AlchRel 0 1 0.01 0.98 -1.08 -0.72 -0.62 0.68 1.74 ▇▁▁▁▃
CarbRel 0 1 0.02 0.99 -2.14 -0.75 -0.26 0.97 2.24 ▁▇▃▃▃
BallingLvl 0 1 0.00 1.02 -2.36 -0.77 -0.66 1.19 1.58 ▁▃▇▁▇

Data visualization

First, we consider a correlation matrix where pairwise complete observations are allowed for use in computing correlations. Due to the low percentage of incomplete cells in the dataset, we prefer this dropping observations with any incomplete columns.

The correlation matrix below uses hierarchical clustering of the predictors to form 6 groups of variables using the corrplot package. This was the most efficient way to gain insight on the related clusters of variables.

M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")

corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )

We compare the correlations to the cleaned pre-processed training data.

Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )

Exporting the Post-Processed Data Sets

Our last step of preprocessing is to export the data sets. This need only but done once.

readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )

readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )

readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )

The above data files have the following properties and relationships to the original raw files:

Pre-Processed Data Files

filename primary_key response numeric_columns num_rows file_source
sdata_train_pp.csv id - consistent with sdata_test_pp PH - unchanged BoxCox, center, scaled, imputed 2055 StudentData.xlsx
sdata_test_pp.csv id - consistent with sdata_train_pp PH - unchanged BoxCox, center, scaled, imputed 512 StudentData.xlsx
edata_test_pp.csv id - standalone PH - unchanged BoxCox, center, scaled, imputed 267 StudentEvaluation.xlsx

An important note about the id column. The id in the file edata_test_pp.csv is the row number of the same observation in StudentEvaluation.xlsx. But the same is not true for sdata_train_pp.csv and sdata_test_pp.csv because they are sampled from the source file StudentData.xlsx. The 10th row of sdata_test_pp.csv is not necessarily the 10th row of StudentData.xlsx rather one should use the id field as the corresponding row number of the original file.

Importing the Pre-Processed Data

We load the pre-processed datasets for training and testing in this section. We load the csv file for the training data set below and display the first few column characteristics from the file.

sdata_train_pp = read_csv("sdata_train_pp.csv", 
                          col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_test_pp  = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_trainX_pp  = sdata_train_pp %>% select( -PH, -id )
sdata_testX_pp   = sdata_test_pp %>% select(-PH , -id)
tuningFull = TRUE   # Set to TRUE if running the lengthy tuning process for all models

Model Building

Gradient Boosted Trees

We consider the same dataset and variable importance problem using gradient boosted trees. The gbm library and variable importance measures are run within caret training procedure. The gradient boosting approach relies on a weak learner in an additive manner that minimizes the given loss function.

set.seed(1027)

gbmControlFull <- trainControl(method = "repeatedcv", 
                           number = 10,  # for debug mode, set this to a low number (1)
                           repeats = 10 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

gbmControlLight <- trainControl(method = "repeatedcv", 
                           number = 5,  # for debug mode, set this to a low number (1)
                           repeats = 1 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

if(tuningFull == TRUE)
{
   gbmControl = gbmControlFull
   gbmTune = expand.grid( n.trees = seq(250, 1000, by = 250),
                                                  shrinkage = c( 0.02, 0.05,  0.1) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 15
                                                  )
} else  {
   gbmControl = gbmControlLight
   gbmTune =  expand.grid( n.trees = c( 2000), 
                                                  shrinkage = c( 0.02, 0.05 , 0.1 ) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 4
                                                  )
}

(gbmTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "gbm",
                          tuneGrid =  gbmTune ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = gbmControl ) )
## Stochastic Gradient Boosting 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   0.02       3                   250     0.1434366  0.3445322  0.1130661
##   0.02       3                   500     0.1399034  0.3781615  0.1093662
##   0.02       3                   750     0.1381810  0.3948955  0.1075668
##   0.02       3                  1000     0.1371433  0.4053153  0.1065292
##   0.02       7                   250     0.1375662  0.3998266  0.1072462
##   0.02       7                   500     0.1339808  0.4329093  0.1037283
##   0.02       7                   750     0.1326817  0.4453548  0.1025127
##   0.02       7                  1000     0.1320709  0.4515327  0.1019375
##   0.05       3                   250     0.1391827  0.3846115  0.1085754
##   0.05       3                   500     0.1370326  0.4073776  0.1063487
##   0.05       3                   750     0.1364019  0.4163811  0.1056864
##   0.05       3                  1000     0.1360931  0.4213869  0.1053766
##   0.05       7                   250     0.1337931  0.4330732  0.1032574
##   0.05       7                   500     0.1324935  0.4472241  0.1018857
##   0.05       7                   750     0.1319768  0.4531566  0.1013501
##   0.05       7                  1000     0.1317495  0.4562800  0.1010483
##   0.10       3                   250     0.1379832  0.3995082  0.1072565
##   0.10       3                   500     0.1371777  0.4135910  0.1063966
##   0.10       3                   750     0.1368283  0.4201240  0.1059677
##   0.10       3                  1000     0.1366484  0.4244918  0.1056588
##   0.10       7                   250     0.1343586  0.4341016  0.1034324
##   0.10       7                   500     0.1335822  0.4437819  0.1027288
##   0.10       7                   750     0.1332925  0.4472134  0.1023291
##   0.10       7                  1000     0.1332832  0.4482130  0.1022717
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 15
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  7, shrinkage = 0.05 and n.minobsinnode = 15.

As the model output indicates the greater tree depth outperforms the lower counts and the shrinkage value of 0.05 produces the best RMSE value. The shrinkage, or learning rate, typically works better with smaller values, yet require longer processing time.

## Saving 7 x 5 in image

The plot of the tuned model clearly shows the better performance with the increased max tree depth but also indicates the lower shrinkage value doesn’t guarantee a lower RMSE. For the max tree depth of 3 and 7, the lowest shrinkage value does not produce the lowest RMSE for that tree depth.

## Saving 7 x 5 in image

The variable importance plot shows the 100% importance of the MnfFlow followed by BrandCode as the only other variable with an importance above 50%. Usagecont, OxygenFiller, AlchRel, and Temperature round out the top 6 variables based on importance with scores greater than 25%.

Overall, the final Gradient Boosted model results in an RMSE of approximately 0.132 and an \(R^2\) less than 0.5.

Cubist

Now we train a Cubist tree using the best selectionFunction on RMSE. The Cubist library and variable importance measures are run within caret training procedure.
The Cubist model is a rules-based approach in which the terminal leaves contains linear regression models. The models are based on predictions defined in the previous nodes of the tree.

set.seed(1027)

cubistControlFull <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridFull  <- expand.grid( committees = c( 10, 50, 100 ) ,
                              neighbors = c( 0, 1, 5, 9 )
                                                  ) 

cubistControlLight <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridLight <- expand.grid( committees = c( 10, 20 ) , 
                              neighbors = c( 0, 5 )  )

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}

(cubistTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )
## Cubist 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE        Rsquared   MAE       
##    10         0          0.10326126  0.6469501  0.07521436
##    10         1          0.10698108  0.6394174  0.07408043
##    10         5          0.09661241  0.6865169  0.06923965
##    10         9          0.09715807  0.6838488  0.06954861
##    50         0          0.10213076  0.6574062  0.07362497
##    50         1          0.10583349  0.6449193  0.07208654
##    50         5          0.09539318  0.6943587  0.06757449
##    50         9          0.09598916  0.6917813  0.06811918
##   100         0          0.10218813  0.6576706  0.07366847
##   100         1          0.10587327  0.6449503  0.07201655
##   100         5          0.09547621  0.6941947  0.06752479
##   100         9          0.09606335  0.6917983  0.06810842
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 5.

The Cubist model results provide a strong RMSE value and \(R^2\) value with generally a higher number of committees and neighbors.

The line plot confirms the better model per RMSE with neighbors at 5 and committees set to 50.

## Saving 7 x 5 in image

The variable importance plot identifies MnfFlow as the most valuable variable followed by Balling, BallingLvl, AlchRel, and PressureVacuum to round out the top 5 variables. Overall, 7 variables resulted in over 50% importance indicating those variables are contained in over 50% of the rules defining the Cubist model.

## Saving 7 x 5 in image

The final Cubist model confirms the high usage rate of the variable MnfFlow.

Cubist Predictor Usage

Conditions Model Variable
82 61 MnfFlow
18 81 Balling
22 71 BallingLvl
19 73 AlchRel
28 60 PressureVacuum
19 63 OxygenFiller
14 66 Density
21 51 BowlSetpoint
8 61 HydPressure3
7 62 Temperature
9 59 AirPressurer
18 47 CarbFlow
6 58 CarbRel
3 54 HydPressure2
3 52 CarbPressure1
11 40 Usagecont
16 31 FillerSpeed
6 39 HydPressure1
41 0 BrandCode
3 33 FillerLevel
0 27 CarbPressure
3 20 CarbVolume
2 21 PCVolume
0 20 CarbTemp
0 20 PressureSetpoint
0 17 FillPressure
0 16 HydPressure4
0 16 PSCFill
0 6 MFR
0 5 FillOunces
0 4 PSC
0 3 PSCCO2

The splits plot indicates the usage of each variable by committee/rule. The values are normalized and highlight the high usage rate of MnfFlow along with the use of greater than or less than inequalities for MnfFlow.

## quartz_off_screen 
##                 2

The dotplot allows us to see the distribution of the coefficient parameters for each linear model associated with a committee/rule.
When a predictor has more dots, it is used in more models which suggests the variable is influential. The color density is proportional to the variable importance. The sign of the coefficient predicts the marginal effect of the predictor on the response PH. However, since Cubist uses many models, the concept of the sign of a predictor coefficient can be unclear.
The x-axis value of the dot in the dotplot indicates the true coefficient in the original units unlike the above plot which normalizes the x-axis.

## quartz_off_screen 
##                 2

The boxplot below which we devised constructs a distribution of coefficients for each predictor. We interpret the sign of the predictor \(X\) marginal effect to be the sign of the median of the distribution of coefficients pf \(X\) over those models which include \(X\). The predictors have been sorted by their median.

The plot provides clarity regarding the impact of a given variable as the median is a good estimate of the sign of the variable coefficient. The AlchRel variable box indicates a positive relationship with the dependent variable PH. The most important variable MnfFlow box results in a negative coefficient, and thus an inverse effect on PH. For all the rules in the final Cubist model, the understanding of the coefficient values for each variable provides a clear relationship to the dependent variable despite the obtuse nature of the Cubist model construction.

## Saving 7 x 5 in image

Overall, the final Cubist model results in an RMSE under 0.1 and a \(R^2\) just below of 0.7.

CART

Using the rpart library and its training methods, we consider the best CART based on the RMSE metric. The CART approach follows a classification and regression tree methodology.

set.seed(10393)

library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 20, by = 1 )  )

cartControl <- trainControl(method = "boot", 
                            number = 30,  
                            selectionFunction = "best")

(rpartTune = train( x = sdata_trainX_pp, 
                    y = sdata_train_pp$PH ,
                   method = "rpart2" ,
                   metric = "RMSE" ,
                   tuneGrid = cartGrid ,
                   trControl = cartControl ) )
## CART 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (30 reps) 
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE       Rsquared   MAE      
##    1        0.1540934  0.2150028  0.1207013
##    2        0.1485972  0.2697600  0.1163125
##    3        0.1439126  0.3150026  0.1130389
##    4        0.1418976  0.3346403  0.1106321
##    5        0.1408007  0.3455111  0.1095038
##    6        0.1398046  0.3557469  0.1085050
##    7        0.1389994  0.3635874  0.1074293
##    8        0.1381955  0.3717156  0.1065335
##    9        0.1373228  0.3802125  0.1057161
##   10        0.1362384  0.3899605  0.1046780
##   11        0.1353910  0.3979319  0.1038336
##   12        0.1345705  0.4055253  0.1028965
##   13        0.1340562  0.4106769  0.1022712
##   14        0.1335638  0.4154690  0.1017284
##   15        0.1330891  0.4202569  0.1012114
##   16        0.1327996  0.4230057  0.1008957
##   17        0.1327006  0.4239743  0.1008136
##   18        0.1326532  0.4245372  0.1007342
##   19        0.1326119  0.4250877  0.1006977
##   20        0.1326119  0.4250877  0.1006977
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 19.

The model output shows increasing tree depth leads to improved RMSE and \(R^2\) values.

The below plot displays the improving RMSE value with the increasing tree depth with an optimal tree depth of 19.

## Saving 7 x 5 in image

Once again, the variable MnfFlow is the most important variable with an importance rating of 100%. Six variables achieve an importance score of greater than 50% with BrandCode, Temperature, and CarbRel each achieving greater than 75% importance.

## Saving 7 x 5 in image

The CART tree plot of the final model reaffirms the importance of the variable MnfFlow as the initial node splits on MnfFlow. One of the second level nodes is BrandCode, another highly important variable to the CART model. Eight ff the top 10 most important variables in the Cubist model are found in the top 15 variables of the CART model. This overlap of variable importance shows a consistency across the different model approaches.

Overall, the final CART model results in an RMSE of approximately 0.132 and an \(R^2\) less than 0.5.

MARS

Now we train a MARS model based on the implementation of earth from the caret library. The MARS approach is a non-linear regression model that does not rely on the original variables but instead on derived features. The tuned MARS model defines the derived features as linear combinations of the initial variables.

## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598    RSS 27.25932    GRSq 0.5260713    RSq 0.5555922
## Call: earth(x=tbl_df[2055,32], y=c(8.36,8.26,8.9...), keepxy=TRUE, degree=2,
##             nprune=27)
## 
##                                                         coefficients
## (Intercept)                                                8.3958776
## h(-0.215889-MnfFlow)                                       0.1780017
## h(HydPressure3- -1.29159)                                  0.0391037
## h(-0.728732-Temperature)                                  -0.0805473
## h(Temperature- -0.728732)                                 -0.0477016
## h(-0.658787-Balling)                                       0.3588441
## h(-0.832915-OxygenFiller)                                 -0.0445311
## h(OxygenFiller- -0.832915)                                -0.0417267
## h(BowlSetpoint- -1.34947)                                  0.0724302
## h(0.640105-AlchRel)                                        0.0395901
## h(AlchRel-0.640105)                                        0.1293063
## h(1.05286-MnfFlow) * BrandCodeC                           -0.0902482
## h(MnfFlow-1.05286) * BrandCodeC                           -1.2433037
## h(-0.215889-MnfFlow) * h(PressureVacuum-0.386629)         -0.0625259
## h(-0.215889-MnfFlow) * h(0.386629-PressureVacuum)         -0.1069156
## h(-0.215889-MnfFlow) * h(AirPressurer-0.839882)           -0.1043474
## h(-0.215889-MnfFlow) * h(0.839882-AirPressurer)            0.0605090
## h(MnfFlow- -0.215889) * h(0.640105-AlchRel)               -0.0669114
## h(1.75812-CarbPressure1) * h(BowlSetpoint- -1.34947)      -0.0120925
## h(0.1987-HydPressure1) * h(-0.658787-Balling)             -0.2036569
## h(0.21892-FillerSpeed) * h(Balling- -0.658787)             0.0158849
## h(FillerSpeed-0.21892) * h(Balling- -0.658787)             0.2165359
## h(Temperature-1.27402) * h(BowlSetpoint- -1.34947)         0.0275573
## h(Usagecont-0.0816887) * h(Balling- -0.658787)            -0.0381250
## h(Density-0.482509) * h(BowlSetpoint- -1.34947)           -0.0578527
## h(0.386629-PressureVacuum) * h(0.640105-AlchRel)           0.0204706
## h(OxygenFiller- -0.832915) * h(AirPressurer- -0.700718)    0.0305074
## 
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598    RSS 27.25932    GRSq 0.5260713    RSq 0.5555922

The plot of the final MARS model indicates increasing the number of terms leads to a lower RMSE.

## Saving 7 x 5 in image

Once again, the variable MnfFlow achieves an importance score of 100%. Only 3 variables reach the score of 50%, the aforementioned MnfFlow, BrandCodeC, and AlchRel.

The variable MnfFlow is influential because the variable effects many other variables in the final model. Due to the numerous linear combinations including MnfFlow, the relationship to PH is difficult to ascertain. The MnfFlow variable requires more investigation into the variable interactions to determine the impact to PH for the MARS model.

## Saving 7 x 5 in image

Overall, the MARS model reaffirms the importance of the variable MnfFlow and also results in an RMSE and \(R^2\) in line with the previous models for Gradient Boosted Trees and CART.

Linear Regression

We also run a simple OLS regression on PH in terms of the scaled and centered variables. While its predictive performance is weaker than the other models on the training set, the coefficients are much easier to interpret.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51932 -0.07897  0.00917  0.08626  0.44987 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.586130   0.010428 823.364  < 2e-16 ***
## CarbVolume       -0.005488   0.008389  -0.654 0.513047    
## FillOunces       -0.004105   0.003163  -1.298 0.194451    
## PCVolume         -0.006448   0.003678  -1.753 0.079740 .  
## CarbPressure     -0.003088   0.011767  -0.262 0.793033    
## CarbTemp          0.008475   0.010712   0.791 0.428936    
## PSC              -0.003168   0.003147  -1.007 0.314160    
## PSCFill          -0.005762   0.003066  -1.879 0.060361 .  
## PSCCO2           -0.006170   0.003020  -2.043 0.041221 *  
## MnfFlow          -0.089502   0.006520 -13.728  < 2e-16 ***
## CarbPressure1     0.030465   0.003680   8.278 2.25e-16 ***
## FillPressure      0.005601   0.004501   1.244 0.213528    
## HydPressure1      0.004482   0.005177   0.866 0.386658    
## HydPressure2     -0.027126   0.009963  -2.723 0.006533 ** 
## HydPressure3      0.059090   0.010506   5.624 2.12e-08 ***
## HydPressure4      0.001523   0.004624   0.329 0.741982    
## FillerLevel      -0.021768   0.008733  -2.493 0.012760 *  
## FillerSpeed       0.003863   0.004923   0.785 0.432809    
## Temperature      -0.019234   0.003564  -5.397 7.59e-08 ***
## Usagecont        -0.020299   0.003911  -5.191 2.31e-07 ***
## CarbFlow          0.012499   0.004515   2.768 0.005685 ** 
## Density          -0.018774   0.009023  -2.081 0.037598 *  
## MFR               0.002030   0.003898   0.521 0.602651    
## Balling          -0.091337   0.014856  -6.148 9.43e-10 ***
## PressureVacuum   -0.016446   0.004588  -3.585 0.000345 ***
## OxygenFiller     -0.015182   0.004217  -3.600 0.000326 ***
## BowlSetpoint      0.053190   0.009135   5.823 6.72e-09 ***
## PressureSetpoint -0.013892   0.004500  -3.087 0.002047 ** 
## AirPressurer     -0.002649   0.003204  -0.827 0.408420    
## AlchRel           0.034578   0.011814   2.927 0.003463 ** 
## CarbRel           0.005015   0.006593   0.761 0.447003    
## BallingLvl        0.097856   0.017857   5.480 4.79e-08 ***
## BrandCodeA       -0.095705   0.025430  -3.764 0.000172 ***
## BrandCodeC       -0.139671   0.010233 -13.649  < 2e-16 ***
## BrandCodeD       -0.043954   0.029625  -1.484 0.138054    
## BrandCodeE       -0.057782   0.015435  -3.744 0.000186 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1325 on 2019 degrees of freedom
## Multiple R-squared:  0.4223, Adjusted R-squared:  0.4123 
## F-statistic: 42.16 on 35 and 2019 DF,  p-value: < 2.2e-16

The most significant variables are MnfFlow, CarbPressure, HydPressure3, Temperature, Usagecont, Balling, PressureVacuum, OxygenFiller, BowlSetpoint, BallingLvl, and BrandCode. Seven of these 11 variables are found in the top 10 most important variables of the Cubist model. Upon closer evaluation, the median value of the variable coefficients from the top 10 of the Cubist model match the sign of the coefficient value in the linear regression model. This finding further supports the assessment of the Cubist model variable impacts on the dependent variable PH.

We summarize the plot of coefficients below.

## NULL

The diagnostic plots of the OLS model suggest that the histogram of residuals is relatively symmetric and looks normal.
The QQ plot suggests the model fit does not fully capture the fat tails possibly due to a small number of outliers.

Overall, the linear regression model does not perform as well as the Cubist model but does provide a valuable sanity check and baseline to the results from the more advanced models.

Model Selection

Based on the above results, we summarize their training and test RMSE and \(R^2\) below.

The joint results are tabulated below.

# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()

model_list = list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)

model_stats = data.frame()

for( modelObj in model_list )
{
    if( modelObj$method != 'lm' ){
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
    }
    else
    {
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
    }
    output <- data.frame( modelName = modelObj$method ,
                     trainRMSE = getTrainPerf(modelObj)[1,1] ,
                     trainR2   = getTrainPerf(modelObj)[1,2] ,
                     testRMSE  = caret::RMSE( pred_data,  sdata_test_pp$PH) ,
                     testR2    = caret::R2(   pred_data,  sdata_test_pp$PH) 
                     )

    model_stats <- rbind(model_stats, output)
}

Model Results Comparison

modelName trainRMSE trainR2 testRMSE testR2
cubist 0.095 0.694 0.090 0.722
rpart2 0.133 0.425 0.125 0.474
lm 0.134 0.396 0.127 0.449
gbm 0.132 0.456 0.151 0.342
earth 0.125 0.486 0.153 0.279
## save_kable will have the best result with magick installed.

We also inspect the plot of the predicted versus the observed data for the 5 models. The scatterplots below confirm the Cubist model outperforms the others on both the training set in the upper row and also the test data on the lower rows. The distribution of the Cubist plot mostly closely aligns with the 45-degree line which corresponds to an accurate prediction.

Discussion

We conclude that the best model is Cubist but the multilinear regression model, MARS and CART models provide useful insight on the main predictors.

The top 5 variables that impact PH are:

Lastly, we observe that Brand is probably not a manufacturing predictor that can be controlled. Mostly likely Brand refers to a recipe for the final product. However, knowing the desired optimal PH for each Brand may help us control the most influential variables that control PH, thereby providing better quality control. While there are over 30 predictors in the dataset, we recommend looking at the top 10 most influential predictors. Most of the models agree substantially on which predictors to include in the top 10.

For example, the linear regression model coefficients agree with the sign and variable importance of the top 10 predictors of the Cubist model. Importance predictors of the OLS model have the highest statistical significance - consistent with variable importance measure of caret.

Final Predictions

## Cubist 
## 
## 2567 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2310, 2309, 2310, 2310, 2310, 2311, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE        Rsquared   MAE       
##    10         0          0.09786899  0.6814835  0.07063050
##    10         1          0.10347326  0.6605109  0.07157433
##    10         5          0.09205621  0.7148932  0.06565672
##    10         9          0.09275054  0.7104184  0.06629558
##    50         0          0.09733927  0.6877932  0.07034005
##    50         1          0.10284267  0.6632483  0.07070141
##    50         5          0.09105650  0.7209833  0.06455347
##    50         9          0.09179098  0.7169196  0.06532125
##   100         0          0.09682465  0.6925441  0.07016957
##   100         1          0.10246532  0.6656340  0.07043498
##   100         5          0.09042487  0.7248834  0.06419195
##   100         9          0.09121584  0.7206745  0.06503557
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 100 and neighbors = 5.

RMSE of the full StudentData dataset: 0.0904249.

\(R^2\) of the full StudentData dataset: 0.7248834.

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)
# Conditional Code Evaluation is managed here.

suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(rpart.plot))
suppressPackageStartupMessages(library(partykit))
suppressPackageStartupMessages(library(gbm))
library(cowplot)
library(skimr)
library(GGally) # for ggpairs
library(readxl) # to parse Excel workbooks
library(openxlsx) # for writing xlsx files
library(corrplot)
library(RColorBrewer)
library(caret)
library(VSURF)
library(Cubist)
library(rpart)
library(party)
library(writexl)
.code-bg {
  background-color: lightgray;
}

raw_StudentData <- read_excel("StudentData.xlsx")
dim(raw_StudentData)
skim(raw_StudentData) %>% 
  yank("numeric") %>% 
  arrange( complete_rate) %>% 
  kable( digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(raw_StudentData) %>% yank("character") %>%
  kable(  caption = "Character Variables in Beverage Data" , digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

raw_StudentData %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` ) %>%
  filter( !is.na(PH)) %>% as_tibble() ->  sdata_v1

dim(sdata_v1)
# include any rows where any column is NA.  Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id  ) %>% relocate(id)-> missing_train_rows
  
dim(missing_train_rows)
missing_train_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_train_summary

missing_train_summary %>% group_by(num_na) %>%
  summarize( count = n()) %>%
  kable( caption = "Number of Rows with NA") %>%
  kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells = sum( missing_train_summary$num_na)
num_cells = nrow(sdata_v1) * 32  # excludes the response variable PH since we dropped its values.

print(paste0("There are ", num_na_cells , " cells with NA values out of ", num_cells, " equivalent to: " , sprintf("%.2f%%", 100 * num_na_cells / num_cells ) ) )
raw_StudentEvaluation <- read_excel("StudentEvaluation.xlsx")
dim(raw_StudentEvaluation)
skim(raw_StudentEvaluation) %>% 
  yank("numeric") %>% 
  arrange( complete_rate) %>% 
  kable( digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(raw_StudentEvaluation) %>% yank("character") %>%
  kable(  caption = "Character Variables in Beverage Data" , digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
raw_StudentEvaluation %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` )  %>% as_tibble() ->  edata_v1

dim(edata_v1)
# include any rows where any column is NA.  Always include id column
#
#  NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
#  for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id  ) %>% relocate(id)-> missing_test_rows
  
dim(missing_test_rows)
missing_test_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_test_summary

missing_test_summary %>% group_by(num_na) %>%
  summarize( count = n()) %>%
  kable( caption = "Number of Rows with NA") %>%
  kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells_test = sum( missing_test_summary$num_na)
num_cells_test = nrow(edata_v1) * 32  # excludes the response variable PH since we dropped its values.

print(paste0("There are ", num_na_cells_test , " cells with NA values (excluding PH) out of ", num_cells_test, " equivalent to: " , 
             sprintf("%.2f%%", 100 * num_na_cells_test / num_cells_test ) ) )


skim_sdata = skim(sdata_v1)
skim_edata = skim(edata_v1)

skim_sdata %>% select(  skim_variable, n_missing, complete_rate) %>% inner_join(skim_edata %>% select(skim_variable, n_missing, complete_rate), by = c("skim_variable") ) %>% filter(skim_variable != 'PH' ) -> skim_complete_rates 

skim_complete_rates %>% rename( training_complete = complete_rate.x, testing_complete = complete_rate.y) -> skim_complete_rates
skim_complete_rates %>% as_tibble() %>%  ggplot(aes(x=training_complete, y = testing_complete)) + 
  geom_point() + geom_smooth(method = "lm") + theme(aspect.ratio = 1) + lims( x = c(0.95, 1), y = c(0.95, 1)) +
  labs(title = "Completion Rate by Predictor between Training and Test Sets")

# Feature plot for the numeric predictor variables against the result variable PH
cols <- sdata_v1 %>%
  select(-c('id', 'BrandCode', 'PH')) %>% colnames()

featurePlot(sdata_v1[,cols], 
            sdata_v1$PH, 
            plot="scatter",
            type = c("p", "smooth"),
            span = .5,
            layout=c(6,6))
#https://statisticsglobe.com/histogram-density-for-each-column-data-frame-r
data_long <- sdata_v1[,cols] %>%
  pivot_longer(cols) %>%
  as.data.frame()

ggp2 <- ggplot(data_long, aes(x = value)) +
  geom_density() +
  facet_wrap(~ name, scales="free") +
  labs(title = "Density Plot of Predictor Variables")
ggp2
# http://www.sthda.com/english/wiki/normality-test-in-r
# From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
shap_test_res <- lapply(sdata_v1[,cols], shapiro.test)
# Result: None of the numeric variables are normally distributed.

# https://stackoverflow.com/questions/62306712/how-to-select-only-the-p-value-0-05-after-performing-shapiro-wilk-test-in-rstud
subset_vector  <- sapply(shap_test_res, function(x) x$p.value > .05)
results_subset <- shap_test_res[subset_vector]

length(results_subset)
ggplot(data = sdata_v1) +
  geom_bar(mapping = aes(x = `BrandCode`)) +
  labs(title = "Distribution of Brand Code")
ggplot(data = sdata_v1, mapping = aes(x = `BrandCode`, y = PH)) +
  geom_boxplot() +
  labs(title = "Boxplot of Brand Code")
sdata_v1_no_nas <- sdata_v1 %>% drop_na()

bev.vsurf <- VSURF(sdata_v1_no_nas[,cols], 
                   sdata_v1_no_nas$PH,
                   ntree = 10,
                   nfor.thres = 20,
                   nfor.interp = 10, nfor.pred = 10)
bev.vsurf
summary(bev.vsurf)
bev.vsurf$varselect.pred
set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )

sdata_test    = sdata_v1[-train_indices , ]
sdata_train   = sdata_v1[ train_indices , ]
sdataY_test   = sdata_v1$PH[-train_indices ]
sdataY_train  = sdata_v1$PH[ train_indices ]

nearZeroVar(sdata_train, saveMetrics = TRUE) %>% 
  kable(caption = "No Near Zero Variance Predictors in Training", digits = 2 ) %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(sdata_train) %>% 
 mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , skim_variable != 'PH') %>%
  arrange( desc(zscore_extreme)) %>%
  dplyr::select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd, numeric.hist, zscore_min, zscore_max ) %>%
  kable(digits = 2, caption = "Predictors with Outliers by Z-Score" ) %>%
  kable_styling(bootstrap_options = c("hover" , "striped"), position = "left")
pl1 = sdata_train %>% ggplot(aes(x=PH, y = MFR)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl2 = sdata_train %>% ggplot(aes(x=PH, y = OxygenFiller)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl3 = sdata_train %>% ggplot(aes(x=PH, y = Temperature)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl4 = sdata_train %>% ggplot(aes(x=PH, y = CarbRel)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl5 = sdata_train %>% ggplot(aes(x=PH, y = AirPressurer)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl6 = sdata_train %>% ggplot(aes(x=PH, y = PSCCO2)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl7 = sdata_train %>% ggplot(aes(x=PH, y = FillPressure)) + geom_point(size=0.3) + theme(aspect.ratio = 1)

plotlist= list(pl1, pl2, pl3, pl4, pl5, pl6, pl7)
plot_grid(plotlist=plotlist, nrow = 2 )


median_MFR = median( sdata_train$MFR  , na.rm = TRUE ) 

sdata_train %>% 
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode))  -> sdata_v2_train

sdata_test  %>% 
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode))  -> sdata_v2_test

edata_v1 %>%
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) ->  edata_v2



# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected.  They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale",  "knnImpute") )

# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc,  as.data.frame(sdata_v2_train ) )

# Becomes the final version of test data for validation
sdata_test_pp  = predict( preProcFunc,  as.data.frame(sdata_v2_test) )

# Need to generate predictions based on test data without known response
edata_test_pp  = predict( preProcFunc,  as.data.frame(edata_v2) )


sdata_trainX_pp  = sdata_train_pp %>% select( -PH)
sdata_testX_pp   = sdata_test_pp %>% select(-PH)
edata_testX_pp   = edata_test_pp %>% select( -PH)

skim(sdata_train_pp)
skim(sdata_test_pp)
skim(edata_test_pp)
M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")

corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )
Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )


readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )

readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )

readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )


file_df = tibble( filename = c("sdata_train_pp.csv", "sdata_test_pp.csv", "edata_test_pp.csv"),
                  primary_key = c("id - consistent with sdata_test_pp", "id - consistent with sdata_train_pp", "id - standalone") ,
                  response = c("PH - unchanged", "PH - unchanged", "PH - unchanged" ) ,
                  
                  numeric_columns = c("BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed") ,
                  num_rows = c(2055, 512, 267 ) ,
                  file_source = c("StudentData.xlsx", "StudentData.xlsx", "StudentEvaluation.xlsx")
)

file_df %>% kable(caption = "Pre-Processed Data Files") %>% kable_styling(bootstrap_options = c("hover", "striped"
                                                                                            ), 
                                                                          position = "left")


sdata_train_pp = read_csv("sdata_train_pp.csv", 
                          col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_test_pp  = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_trainX_pp  = sdata_train_pp %>% select( -PH, -id )
sdata_testX_pp   = sdata_test_pp %>% select(-PH , -id)
tuningFull = TRUE   # Set to TRUE if running the lengthy tuning process for all models
set.seed(1027)

gbmControlFull <- trainControl(method = "repeatedcv", 
                           number = 10,  # for debug mode, set this to a low number (1)
                           repeats = 10 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

gbmControlLight <- trainControl(method = "repeatedcv", 
                           number = 5,  # for debug mode, set this to a low number (1)
                           repeats = 1 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

if(tuningFull == TRUE)
{
   gbmControl = gbmControlFull
   gbmTune = expand.grid( n.trees = seq(250, 1000, by = 250),
                                                  shrinkage = c( 0.02, 0.05,  0.1) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 15
                                                  )
} else  {
   gbmControl = gbmControlLight
   gbmTune =  expand.grid( n.trees = c( 2000), 
                                                  shrinkage = c( 0.02, 0.05 , 0.1 ) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 4
                                                  )
}

(gbmTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "gbm",
                          tuneGrid =  gbmTune ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = gbmControl ) )
gbm_plot <- ggplot(gbmTune) + labs(title = "Gradient Boosted Trees - Tuning")
gbm_plot
ggsave("GradientBoostedTreesTurning.jpg", gbm_plot)
gbm_vi_plot <- ggplot(varImp(gbmTune, numTrees = gbmTune$finalModel$n.trees) ) + labs(title = "Gradient Boosted Trees - Variance Importance")
gbm_vi_plot
ggsave("GbmVarImp.jpg", gbm_vi_plot)
set.seed(1027)

cubistControlFull <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridFull  <- expand.grid( committees = c( 10, 50, 100 ) ,
                              neighbors = c( 0, 1, 5, 9 )
                                                  ) 

cubistControlLight <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridLight <- expand.grid( committees = c( 10, 20 ) , 
                              neighbors = c( 0, 5 )  )

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}

(cubistTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )
cubist_plot <- ggplot(cubistTune) + labs(title="Cubist Model Tuning")
cubist_plot
ggsave("CubistTuning.jpg", cubist_plot)
cubist_vi_plot <- ggplot(varImp(cubistTune), top = 20) + labs(title = "Cubist Model Variable Importance")
cubist_vi_plot
ggsave("CubistVarImp.jpg", cubist_vi_plot)
cubistTune$finalModel$usage %>% arrange( desc(Model+ Conditions)) %>% 
  kable(caption = "Cubist Predictor Usage") %>% kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
cubist_splitsplot <- dotplot( cubistTune$finalModel, what = "splits" )
cubist_splitsplot
# Output to file
jpeg(file="CubistSplitsPlot.jpg")
cubist_splitsplot
dev.off()
cubist_dotplot <- dotplot( cubistTune$finalModel, 
         what = "coefs", 
         between = list(x = 0.2, y = 0.5) , 
         scales = list(x = list(relation = "free"),  
                       y = list(cex = 0.25)  ) )
cubist_dotplot
jpeg(file="CubistDotPlot.jpg")
cubist_splitsplot
dev.off()
cubistTune$finalModel$coefficients %>% 
  rownames_to_column(var="id") %>% 
  select(-committee, -rule) %>% 
  pivot_longer(!id, names_to = "predictor", values_to = "coef" ) %>% 
  filter( !is.na(coef)) %>% 
  filter( predictor != '(Intercept)' )  %>%
  filter(abs(coef) < 20 ) -> coef_piv

cubist_boxplot <- coef_piv %>% 
  ggplot() + 
  geom_boxplot(
    aes(x=reorder(predictor, coef, FUN = median, na.rm = TRUE ),  # order the boxplots by median
        y = coef ), outlier.shape = NA ) +  # strip out the outliers
  coord_flip(ylim=c(-.5,.5)) +
  labs(title = "Distribution of Coefficients of Predictors", 
       subtitle = "from Cubist Committee/Rules",
       x = "Predictors" ,
       y = "Coefficient in Linear Model"
       )

cubist_boxplot
ggsave("CubistBoxPlot.jpg", cubist_boxplot)
set.seed(10393)

library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 20, by = 1 )  )

cartControl <- trainControl(method = "boot", 
                            number = 30,  
                            selectionFunction = "best")

(rpartTune = train( x = sdata_trainX_pp, 
                    y = sdata_train_pp$PH ,
                   method = "rpart2" ,
                   metric = "RMSE" ,
                   tuneGrid = cartGrid ,
                   trControl = cartControl ) )
cart_tuning_plot <- ggplot(rpartTune) + labs(title="CART Tree Tuning")
cart_tuning_plot
ggsave("CartTuningPlot.jpg", cart_tuning_plot)
cart_vi_plot <- ggplot(varImp( rpartTune), top = 20 ) + labs(title = "CART Tree Variable Importance")
cart_vi_plot
ggsave("CartVarImp.jpg", cart_vi_plot)
rpart.plot(rpartTune$finalModel, digits = 3)

if( tuningFull == TRUE )
{
    marsGrid = expand.grid( .degree = 1:2, .nprune = seq(2,27, by=5) )
    marsControl = trainControl(method = "cv", 
                     number = 10, 
                
                     selectionFunction = "best", 
                     verboseIter = FALSE)
} else {
    marsGrid = expand.grid( .degree = 1:2, .nprune = seq(2,25, by = 10 ) )
    
    marsControl = trainControl(method = "cv", 
                     number = 10, 
                     
                     selectionFunction = "best", 
                     verboseIter = FALSE)
}
    

set.seed(1000)

marsTuned = caret::train(x = sdata_trainX_pp , 
                         y = sdata_train_pp$PH, 
                         method = "earth" ,
                         metric = "RMSE" , 
                         tuneGrid = marsGrid ,
                         trControl = marsControl )
marsTuned$finalModel

summary(marsTuned)
mars_tuning_plot <- ggplot(marsTuned) + labs(title="MARS Model Tuning" )
mars_tuning_plot
ggsave("MarsTuningPlot.jpg", mars_tuning_plot)
mars_vi_plot <- ggplot( varImp(marsTuned), 20 ) + labs(title = "MARS Model Variable Importance")
mars_vi_plot
ggsave("MarsVarImp.jpg", mars_vi_plot)
set.seed(123)

train.control <- trainControl( method = "cv", number = 5)

linearTune = train( PH ~ . -id, 
                    data = sdata_train_pp, 
                    method = "lm", 
                    tuneGrid = data.frame(intercept = TRUE),
                    trControl = train.control)

summary(linearTune$finalModel)
par(mfrow=c(2,2))

lm_final_model_plot <- plot(linearTune$finalModel)
lm_final_model_plot
#ggsave("LinearModelFinalPlot.jpg", lm_final_model_plot)
hist(linearTune$finalModel$residuals)

# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()

model_list = list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)

model_stats = data.frame()

for( modelObj in model_list )
{
    if( modelObj$method != 'lm' ){
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
    }
    else
    {
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
    }
    output <- data.frame( modelName = modelObj$method ,
                     trainRMSE = getTrainPerf(modelObj)[1,1] ,
                     trainR2   = getTrainPerf(modelObj)[1,2] ,
                     testRMSE  = caret::RMSE( pred_data,  sdata_test_pp$PH) ,
                     testR2    = caret::R2(   pred_data,  sdata_test_pp$PH) 
                     )

    model_stats <- rbind(model_stats, output)
}

results_table <- model_stats %>% as_tibble() %>% 
  arrange( testRMSE) %>%
  kable( digits = 3, caption = "Model Results Comparison") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

results_table

save_kable(results_table, "ModelResultsTable.jpg")
predVals = extractPrediction(list(marsTuned, rpartTune, cubistTune, gbmTune), testX = sdata_testX_pp , testY = sdata_test_pp$PH )

# Attempting to hack into the extract prediction object with the LM data
# Cols: obs pred model dataType object
pred_lm =  predict(linearTune, newdata = sdata_test_pp)
pred_lm_train <- linearTune$finalModel$fitted.values

# Create placeholder data for the training values
model_l <- rep("LM", length = 2055)
dt_l <- rep("Training", length = 2055)
obj_l <- rep("Object9", length = 2055)

# Create dataframe of the training data of LM
lm_train_preds <- as.data.frame(cbind(obs=sdata_train_pp$PH, pred=pred_lm_train, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)

# Create placeholder data for the test values
model_l <- rep("LM", length = 512)
dt_l <- rep("Test", length = 512)
obj_l <- rep("Object10", length = 512)

# Create dataframe of the test data of LM
lm_test_preds <- as.data.frame(cbind(obs=sdata_test_pp$PH, pred=pred_lm, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)

# Combine all data for plotting
all_preds <- rbind(predVals, lm_train_preds, lm_test_preds)

# Ensure obs and preds are numerics
all_preds$obs <- as.numeric(all_preds$obs)
all_preds$pred <- as.numeric(all_preds$pred)

# Plot using Caret built-in reporting.
# --------------------------------------
obsVsPredsPlot <- plotObsVsPred(all_preds)
obsVsPredsPlot
#ggsave("ObsVsPredsPlot.jpg", obsVsPredsPlot)
# Combine the initial training and test datasets into 1
sdata_trainfull_pp <- rbind(sdata_train_pp, sdata_test_pp)
sdata_trainfullX_pp <- sdata_trainfull_pp %>% select(-PH , -id)

edata_test_pp = read_csv("edata_test_pp.csv", 
                          col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

# Train the Cubist model
set.seed(1027)

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}

(cubistTune = caret::train( x = sdata_trainfullX_pp, 
                          y = sdata_trainfull_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )

# Make predictions on the provided dataset
final_pred_data <- as.numeric( predict( cubistTune, newdata=edata_test_pp ))

# Output prediction performance
full_cubist_train_rmse <- getTrainPerf(cubistTune)[1,1]
full_cubist_train_r2   <- getTrainPerf(cubistTune)[1,2]
# Output predictions to Excel file
raw_StudentEvaluation <- read_excel("StudentEvaluation.xlsx")
raw_StudentEvaluation$PH <- final_pred_data
write_xlsx(raw_StudentEvaluation, "ANg_PTanofsky_Ph_Predictions.xlsx")