Data Wrangling Strategy

We will describe the data wrangling steps in brief before doing any exploratory data analysis and model building. After loading the raw data files, our transformations and changes are as follows:

Add a primary key id to allow unique identification of all raw observations.
Drop observations with NA in the response variable PH.
Drop observations with a high number of missing values. In our case, observations that have more than 4 missing values.
Rename the variables to remove spaces.
Split the initial data set using stratified sampling based on PH value into a 80/20 train and validate data set. We recognize that a test data set exists but this is for external assessment of the project prediction accuracy.
Identify and drop near zero variance predictors in the training set. And apply the same predictor removals to the validation data set (regardless of their values in the latter).
Identify and fix zeros and outliers in the predictors in the training set as follows:

Identify outliers based on extreme Z scores (e.g. the z-score of the predictor min and max)
Decide if their values are plausible or implausible. Choose an outlier threshold.
Replace all such values by the column median among the training set population.

Handle special cases of variables:

Replace the missing BrandCode categorical value with a dummy BrandCode E
Impute MFR missing value with the training set median.

Apply the knn imputation strategy to Box-Cox transformed, scaled, centered observations for all training set points.

Along the way, we will assess the numerical impact of such changes to the dataset.

## [1] 2571   33

Let’s evaluate the data characteristics of the raw file after loading. We use the skimr library to get statistics quickly for the entire training data set. The table below shows the numeric variables sorted by their completion rate in descending order. So the most problematic predictors are at the top of the table.

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
MFR	212	0.918	704.049	73.898	31.400	706.300	724.000	731.000	868.600	▁▁▁▂▇
Filler Speed	57	0.978	3687.199	770.820	998.000	3888.000	3982.000	3998.000	4030.000	▁▁▁▁▇
PC Volume	39	0.985	0.277	0.061	0.079	0.239	0.271	0.312	0.478	▁▃▇▂▁
PSC CO2	39	0.985	0.056	0.043	0.000	0.020	0.040	0.080	0.240	▇▅▂▁▁
Fill Ounces	38	0.985	23.975	0.088	23.633	23.920	23.973	24.027	24.320	▁▂▇▂▁
PSC	33	0.987	0.085	0.049	0.002	0.048	0.076	0.112	0.270	▆▇▃▁▁
Carb Pressure1	32	0.988	122.586	4.743	105.600	119.000	123.200	125.400	140.200	▁▃▇▂▁
Hyd Pressure4	30	0.988	96.289	13.123	52.000	86.000	96.000	102.000	142.000	▁▃▇▂▁
Carb Pressure	27	0.989	68.190	3.538	57.000	65.600	68.200	70.600	79.400	▁▅▇▃▁
Carb Temp	26	0.990	141.095	4.037	128.600	138.400	140.800	143.800	154.000	▁▅▇▃▁
PSC Fill	23	0.991	0.195	0.118	0.000	0.100	0.180	0.260	0.620	▆▇▃▁▁
Fill Pressure	22	0.991	47.922	3.178	34.600	46.000	46.400	50.000	60.400	▁▁▇▂▁
Filler Level	20	0.992	109.252	15.698	55.800	98.300	118.400	120.000	161.200	▁▃▅▇▁
Hyd Pressure2	15	0.994	20.961	16.386	0.000	0.000	28.600	34.600	59.400	▇▂▇▅▁
Hyd Pressure3	15	0.994	20.458	15.976	-1.200	0.000	27.600	33.400	50.000	▇▁▃▇▁
Temperature	14	0.995	65.968	1.383	63.600	65.200	65.600	66.400	76.200	▇▃▁▁▁
Oxygen Filler	12	0.995	0.047	0.047	0.002	0.022	0.033	0.060	0.400	▇▁▁▁▁
Pressure Setpoint	12	0.995	47.615	2.039	44.000	46.000	46.000	50.000	52.000	▁▇▁▆▁
Hyd Pressure1	11	0.996	12.438	12.433	-0.800	0.000	11.400	20.200	58.000	▇▅▂▁▁
Carb Volume	10	0.996	5.370	0.106	5.040	5.293	5.347	5.453	5.700	▁▆▇▅▁
Carb Rel	10	0.996	5.437	0.129	4.960	5.340	5.400	5.540	6.060	▁▇▇▂▁
Alch Rel	9	0.996	6.897	0.505	5.280	6.540	6.560	7.240	8.620	▁▇▂▃▁
Usage cont	5	0.998	20.993	2.978	12.080	18.360	21.790	23.755	25.900	▁▃▅▃▇
PH	4	0.998	8.546	0.173	7.880	8.440	8.540	8.680	9.360	▁▅▇▂▁
Mnf Flow	2	0.999	24.569	119.481	-100.200	-100.000	65.200	140.800	229.400	▇▁▁▇▂
Carb Flow	2	0.999	2468.354	1073.696	26.000	1144.000	3028.000	3186.000	5104.000	▂▅▆▇▁
Bowl Setpoint	2	0.999	109.327	15.303	70.000	100.000	120.000	120.000	140.000	▁▂▃▇▁
Density	1	1.000	1.174	0.378	0.240	0.900	0.980	1.620	1.920	▁▅▇▂▆
Balling	1	1.000	2.198	0.931	-0.170	1.496	1.648	3.292	4.012	▁▇▇▁▇
Balling Lvl	1	1.000	2.050	0.870	0.000	1.380	1.480	3.140	3.660	▁▇▂▁▆
Pressure Vacuum	0	1.000	-5.216	0.570	-6.600	-5.600	-5.400	-5.000	-3.600	▂▇▆▂▁
Air Pressurer	0	1.000	142.834	1.212	140.800	142.200	142.600	143.000	148.200	▅▇▁▁▁

Next, we display the one character variable Brand below and its skimr output below.

Character Variables in Beverage Data

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.953	1	1	0	4	0

After loading the raw Excel file, we find the training dataset has 2571 observations with 32 predictors and 1 response variable PH.

General observations about the skimr results:

The predictor MFR is missing the most data: about 8% (212) of its observations are missing. The percentage of missing observations is too high for automated imputation in our opinion.
The predictor Filler Speed is missing the next most data: about 2.2% (57) of its observations are missing. The percentage is low enough for us to consider imputation.
The overall data completeness rate is very high. We regard this dataset has being very high quality. 21 predictors have completeness rates over 99%. Except for MFR, Brand, all of the other predictors have data completeness rates over 97.8%. Therefore, except for Brand and MFR, an automated data imputation policy can be considered for those columns.
The response PH has 4 missing observations. We will decide to drop those 4 observations as we do not wish to impute the response.

Our first transformation below is to add a unique identifier id based on row number to obtain a primary key.
Then we drop the 4 observations which are missing PH values, rearrange the response variable PH and BrandCode for our convenience. Lastly, we also eliminate spaces from the predictor column names to avoid using quote in our code.

raw_StudentData %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` ) %>%
  filter( !is.na(PH)) %>% as_tibble() ->  sdata_v1

dim(sdata_v1)

## [1] 2567   34

Rows with Missing Data

Is there any observations with a high number of missing observations?

# include any rows where any column is NA.  Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id  ) %>% relocate(id)-> missing_train_rows
  
dim(missing_train_rows)

## [1] 529  28

We find there are 529 observations with NA observations. So 20.6% have NA values. We conclude that the NA values are sufficiently well distributed that removal of all observations with NA values is impractical.

In the marginal table below, we investigate if certain observations have a high incidence of NA values amongst their predictors. The answer shows that the incident of multiple

Number of Rows with NA

num_na	count
1	345
2	119
3	45
4	13
5	3
6	2
7	1
8	1

## [1] "There are 812 cells with NA values out of 82144 equivalent to: 0.99%"

Overall, only 1 percent of cells have missing values.
The concentration of missing values seems to decline at a geometric ratio suggesting independence among the occurrence of missing values.

Evaluation of Missing Data in the Test Set

## [1] 267  33

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
MFR	31	0.884	697.801	96.400	15.600	707.000	724.600	731.450	784.800	▁▁▁▁▇
Filler Speed	10	0.963	3581.385	911.187	1006.000	3812.000	3978.000	3996.000	4020.000	▁▁▁▁▇
Fill Ounces	6	0.978	23.970	0.076	23.747	23.920	23.967	24.013	24.200	▁▅▇▃▁
PSC	5	0.981	0.085	0.053	0.004	0.044	0.076	0.112	0.246	▆▇▃▂▁
PSC CO2	5	0.981	0.051	0.038	0.000	0.020	0.040	0.060	0.240	▇▃▂▁▁
PC Volume	4	0.985	0.278	0.063	0.099	0.233	0.275	0.322	0.464	▁▆▇▅▁
Carb Pressure1	4	0.985	123.040	4.420	113.000	120.200	123.400	125.500	136.000	▃▃▇▂▁
Hyd Pressure4	4	0.985	97.840	13.921	68.000	90.000	98.000	104.000	140.000	▅▆▇▂▁
PSC Fill	3	0.989	0.190	0.113	0.020	0.100	0.180	0.260	0.620	▇▇▃▁▁
Oxygen Filler	3	0.989	0.047	0.050	0.002	0.020	0.034	0.054	0.398	▇▁▁▁▁
Alch Rel	3	0.989	6.907	0.498	6.400	6.540	6.580	7.180	7.820	▇▁▂▁▃
Fill Pressure	2	0.993	48.141	3.440	37.800	46.000	47.800	50.200	60.200	▁▇▇▂▁
Filler Level	2	0.993	110.292	15.502	69.200	100.600	118.600	120.200	153.200	▂▃▇▇▁
Temperature	2	0.993	66.227	1.692	63.800	65.400	65.800	66.600	75.400	▇▅▁▁▁
Usage cont	2	0.993	20.897	2.999	12.900	18.120	21.440	23.740	24.600	▁▃▃▃▇
Pressure Setpoint	2	0.993	47.733	2.057	44.000	46.000	46.000	50.000	52.000	▁▇▁▆▁
Carb Rel	2	0.993	5.440	0.127	5.180	5.340	5.400	5.560	5.740	▂▇▂▃▂
Carb Volume	1	0.996	5.369	0.111	5.147	5.287	5.340	5.465	5.667	▂▇▃▅▁
Carb Temp	1	0.996	141.226	4.296	130.000	138.400	140.800	143.800	154.000	▁▆▇▃▁
Hyd Pressure2	1	0.996	20.113	17.214	-50.000	0.000	26.800	34.800	61.400	▁▁▆▇▁
Hyd Pressure3	1	0.996	19.609	16.557	-50.000	0.000	27.700	33.000	49.200	▁▁▆▃▇
Density	1	0.996	1.177	0.382	0.060	0.920	0.980	1.600	1.840	▁▁▇▁▅
Balling	1	0.996	2.203	0.925	0.902	1.498	1.648	3.242	3.788	▅▇▁▂▅
Pressure Vacuum	1	0.996	-5.174	0.582	-6.400	-5.600	-5.200	-4.800	-3.600	▁▇▆▃▁
Bowl Setpoint	1	0.996	109.624	15.017	70.000	100.000	120.000	120.000	130.000	▁▂▁▃▇
Air Pressurer	1	0.996	142.831	1.231	141.200	142.200	142.600	142.800	147.200	▅▇▁▁▁
Carb Pressure	0	1.000	68.255	3.857	60.200	65.300	68.000	70.600	77.600	▃▆▇▃▂
Mnf Flow	0	1.000	21.034	117.756	-100.200	-100.000	0.200	141.300	220.400	▇▁▁▆▂
Hyd Pressure1	0	1.000	12.010	13.531	-50.000	0.000	10.400	20.400	50.000	▁▁▇▆▂
Carb Flow	0	1.000	2408.644	1161.364	0.000	1083.000	3038.000	3215.000	3858.000	▂▃▁▆▇
Balling Lvl	0	1.000	2.051	0.883	0.000	1.380	1.480	3.080	3.420	▁▃▇▁▇

Character Variables in Beverage Data

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	8	0.97	1	1	0	4	0

## [1] 267  34

# include any rows where any column is NA.  Always include id column
#
#  NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
#  for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id  ) %>% relocate(id)-> missing_test_rows
  
dim(missing_test_rows)

## [1] 67 28

Number of Rows with NA

num_na	count
1	50
2	11
3	3
4	1
8	1
14	1

## [1] "There are 107 cells with NA values (excluding PH) out of 8544 equivalent to: 1.25%"

We conclude that the distribution of missing values in the test set is:

the completeness rate of each predictors in the test and training sets are comparable (as illustrated by the scatter plot and linear regression line above)
shows 1.25% of cells are missing out - a very similar rate to the training set.
the distribution of missing values grouped by observation appears to be similar to the training set too. The frequency of multiple missing values appears independent and declines geometrically with the number of missing values.

Exploratory Data Analysis

Scatterplot of Predictor Variables

The scatterplots of the predictor variables against the dependent variable PH indicate no clear linear relationship with any predictor variable. The scatterplots indicate some variable measurements are not continuous and instead fall on specific values on the x-axis.

Density Plots

The density plots of each predictor variable show many predictor variables with non-normal distributions.

Normality Test

The Shapiro-Wilk test of normality is performed on all the numeric predictor variables and the resulting list confirms no p-value above 0.05 and thus no predictor variable follows a normal distribution per evaluation.

## [1] 0

Distribution by Brand Code

The distribution of the Brand Code, the lone non-numeric predictor variable, shows a high count of brand B, comprising almost half of the samples. Also note, some values are missing for the Brand Code variable.

Boxplot of PH by Brand Code

The boxplot of PH values by Brand Code certainly captures overlap of the middle two quartiles, which seems reasonable given the PH range is rather narrow. Of note, the median values of the 4 labeled brands and the unlabeled group do appear distinct besides Brand A and the unlabeled.

VSURF

To find significant variables for the models, the VSURF algorithm is applied to the provided dataset. The algorithm identifies 14 variables as significant to the predictions.

Mnf Flow
Usage cont
Carb Rel
Filler Level
Alch Rel
Bowl Setpoint
Balling Lvl
Carb Flow
Temperature
Oxygen Filler
Balling
Pressure Vacuum
Filler Speed
Hyd Pressure1

## Thresholding step
## Estimated computational time (on one core): 12.3 sec.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |======================================================================| 100%
## Interpretation step (on 26 variables)
## Estimated computational time (on one core): between 799.5 sec. and  2356.1 sec.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%
## Prediction step (on 17 variables)
## Maximum estimated computational time (on one core): 970.2 sec.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |================                                                      |  24%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |======================================================                |  76%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |======================================================================| 100%

## ** VSURF results **
## The results object is a list of length around 20.
## Most interesting components are the following:
## 
##   name                description                                       
## 1 "$varselect.thres"  "variables selected after thesholding step"       
## 2 "$varselect.interp" "variables selected after interpretation step"    
## 3 "$varselect.pred"   "variables selected after prediction step"        
## 4 "$ord.imp$x"        "mean VI in decreasing order for all variables"   
## 5 "$ord.imp$ix"       "indices of the ordering of all variables VI mean"
## 6 "mean.perf"         "mean OOB rate for RF build on all variables"     
## 
##  For more information about VSURF outputs see the VSURF help page

## 
##  VSURF computation time: 36.2 mins 
## 
##  VSURF selected: 
##  26 variables at thresholding step (in 7.4 secs)
##  17 variables at interpretation step (in 23.4 mins)
##  16 variables at prediction step (in 12.6 mins)

##  [1]  9 19 30 16 29 26 31 20 18 23 25 24 27 17 12 28

Data Partition into Training and Test Sets

Next we partition the student evaluation data set by training and test sets. We will retain the terminology of test set to refer to the validation data set used for model performance assessment outside of the cross validation process.

set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )

sdata_test    = sdata_v1[-train_indices , ]
sdata_train   = sdata_v1[ train_indices , ]
sdataY_test   = sdata_v1$PH[-train_indices ]
sdataY_train  = sdata_v1$PH[ train_indices ]

Drop Near Zero Variance Predictors

We conclude there are no near zero variance predictors or constant predictors to be dropped from the training data set.

No Near Zero Variance Predictors in Training

	freqRatio	percentUnique	zeroVar	nzv
id	1.00	100.00	FALSE	FALSE
PH	1.05	2.48	FALSE	FALSE
CarbVolume	1.03	4.72	FALSE	FALSE
FillOunces	1.18	4.38	FALSE	FALSE
PCVolume	1.19	21.22	FALSE	FALSE
CarbPressure	1.10	5.06	FALSE	FALSE
CarbTemp	1.04	5.99	FALSE	FALSE
PSC	1.14	6.23	FALSE	FALSE
PSCFill	1.11	1.56	FALSE	FALSE
PSCCO2	1.08	0.63	FALSE	FALSE
MnfFlow	1.08	21.75	FALSE	FALSE
CarbPressure1	1.02	6.37	FALSE	FALSE
FillPressure	1.78	5.06	FALSE	FALSE
HydPressure1	31.86	11.58	FALSE	FALSE
HydPressure2	7.63	9.39	FALSE	FALSE
HydPressure3	12.71	8.91	FALSE	FALSE
HydPressure4	1.02	1.90	FALSE	FALSE
FillerLevel	1.05	12.90	FALSE	FALSE
FillerSpeed	1.00	10.41	FALSE	FALSE
Temperature	1.05	2.63	FALSE	FALSE
Usagecont	1.12	22.43	FALSE	FALSE
CarbFlow	1.33	23.21	FALSE	FALSE
Density	1.19	3.65	FALSE	FALSE
MFR	1.04	25.94	FALSE	FALSE
Balling	1.24	9.34	FALSE	FALSE
PressureVacuum	1.43	0.78	FALSE	FALSE
OxygenFiller	1.32	15.67	FALSE	FALSE
BowlSetpoint	2.80	0.49	FALSE	FALSE
PressureSetpoint	1.33	0.39	FALSE	FALSE
AirPressurer	1.03	1.41	FALSE	FALSE
AlchRel	1.10	2.53	FALSE	FALSE
CarbRel	1.02	1.95	FALSE	FALSE
BallingLvl	1.23	3.89	FALSE	FALSE
BrandCode	2.01	0.19	FALSE	FALSE

Outliers and Zeros

Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum.

\[ \text{zscore_min}(X_i) = \frac{ min(X_i) - \mu(X_i) }{\sigma(X_i)}\] \[\text{zscore_max}(X_i) = \frac{ max(X_i) - \mu(X_i) }{\sigma(X_i )}\]

Predictors with Outliers by Z-Score

skim_variable	n_missing	complete_rate	numeric.mean	numeric.sd	numeric.hist	zscore_min	zscore_max
MFR	156	0.92	704.15	74.72	▁▁▁▂▇	-9.00	2.20
OxygenFiller	7	1.00	0.05	0.04	▇▁▁▁▁	-0.97	7.88
Temperature	8	1.00	65.95	1.36	▇▃▁▁▁	-1.73	7.55
CarbRel	6	1.00	5.44	0.13	▁▇▇▂▁	-3.70	4.84
AirPressurer	0	1.00	142.83	1.20	▇▇▁▁▁	-1.52	4.48
PSCCO2	29	0.99	0.06	0.04	▇▅▂▁▁	-1.31	4.31
FillPressure	15	0.99	47.91	3.14	▁▁▇▅▁	-4.23	3.72

Using scatterplots of each of the 7 identified predictors below, we see no obvious data errors exist in the distribution of the predictors with outlier values.
Isolated points exist but lie within a reasonable proximity to neighboring points in the scatter plot. Therefore no data corrections for outliers will be applied to the training data.

Handle Special Cases

The code below will transform the special case situations of data pre-processing identified earlier:

Replace the MFR missing values with its median value from the training set
Replace the missing BrandCode with a fictional brand code E.

These rules are applied consistently to be the training, validation and test data sets prior to subsequent pre processing.

KNN Imputation

We apply the caret pre processing function to that training, validation and test data sets.

# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected.  They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale",  "knnImpute") )

# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc,  as.data.frame(sdata_v2_train ) )

# Becomes the final version of test data for validation
sdata_test_pp  = predict( preProcFunc,  as.data.frame(sdata_v2_test) )

# Need to generate predictions based on test data without known response
edata_test_pp  = predict( preProcFunc,  as.data.frame(edata_v2) )


sdata_trainX_pp  = sdata_train_pp %>% select( -PH)
sdata_testX_pp   = sdata_test_pp %>% select(-PH)
edata_testX_pp   = edata_test_pp %>% select( -PH)

Data summary

Name	sdata_train_pp
Number of rows	2055
Number of columns	34
_______________________
Column type frequency:
character	1
numeric	33
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
BrandCode	0	1	1	1	0	5	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	1297.57	751.06	1.00	644.50	1296.00	1961.00	2568.00	▇▇▇▇▇
PH	1	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▁▁
CarbVolume	1	0.00	1.00	-3.42	-0.71	-0.26	0.86	2.90	▁▃▇▅▁
FillOunces	1	0.00	0.99	-3.88	-0.62	-0.01	0.60	3.91	▁▂▇▂▁
PCVolume	1	0.01	1.00	-3.72	-0.60	-0.07	0.59	3.01	▁▂▇▃▁
CarbPressure	1	0.01	1.00	-3.11	-0.72	0.04	0.70	2.88	▁▃▇▅▁
CarbTemp	1	0.01	1.00	-3.47	-0.65	0.01	0.69	2.90	▁▂▇▅▁
PSC	1	0.00	0.99	-2.75	-0.65	0.01	0.66	2.83	▁▅▇▅▁
PSCFill	1	0.00	1.00	-1.66	-0.81	-0.13	0.54	3.60	▆▇▃▁▁
PSCCO2	1	0.00	0.99	-1.31	-0.84	-0.37	0.33	4.31	▇▅▂▁▁
MnfFlow	1	0.00	1.00	-1.05	-1.05	0.50	0.97	1.58	▇▁▁▆▂
CarbPressure1	1	-0.01	1.00	-3.63	-0.77	0.13	0.60	3.77	▁▃▇▂▁
FillPressure	1	0.00	1.00	-5.38	-0.58	-0.45	0.70	3.23	▁▁▇▆▁
HydPressure1	1	0.00	1.00	-1.07	-1.01	-0.08	0.65	3.66	▇▅▂▁▁
HydPressure2	1	0.00	1.00	-1.29	-1.29	0.47	0.83	2.34	▇▂▇▅▁
HydPressure3	1	0.00	1.00	-1.37	-1.29	0.44	0.80	1.84	▇▁▃▇▁
HydPressure4	1	0.01	1.00	-3.40	-0.76	0.07	0.52	2.82	▁▅▇▇▁
FillerLevel	1	0.00	1.00	-2.85	-0.72	0.54	0.68	4.30	▁▅▇▁▁
FillerSpeed	1	-0.05	1.05	-3.44	0.20	0.41	0.44	0.51	▁▁▁▁▇
Temperature	1	0.00	1.00	-1.91	-0.57	-0.25	0.38	6.50	▇▆▁▁▁
Usagecont	1	0.00	1.00	-2.46	-0.92	0.20	0.95	1.83	▁▇▅▇▆
CarbFlow	1	0.00	1.00	-2.02	-1.33	0.52	0.69	2.99	▃▁▇▁▁
Density	1	0.00	1.00	-4.85	-0.68	-0.41	1.18	1.71	▁▁▁▇▅
MFR	1	0.00	1.00	-6.70	-0.02	0.29	0.41	3.35	▁▁▁▇▁
Balling	1	0.00	1.00	-6.22	-0.74	-0.50	1.19	1.68	▁▁▁▇▅
PressureVacuum	1	0.00	1.00	-2.42	-0.67	-0.31	0.39	2.84	▂▇▅▃▂
OxygenFiller	1	0.00	1.00	-1.98	-0.35	0.03	0.62	3.21	▃▇▆▂▁
BowlSetpoint	1	0.00	1.00	-2.39	-0.73	0.70	0.70	2.40	▁▃▃▇▁
PressureSetpoint	1	0.00	1.00	-1.94	-0.77	-0.77	1.16	1.96	▁▇▁▆▁
AirPressurer	1	0.00	1.00	-1.58	-0.53	-0.18	0.16	4.37	▅▇▁▁▁
AlchRel	1	0.00	1.00	-5.09	-0.72	-0.62	0.79	2.76	▁▁▇▂▃
CarbRel	1	0.00	1.00	-4.28	-0.75	-0.26	0.97	4.21	▁▂▇▅▁
BallingLvl	1	0.00	1.00	-2.36	-0.77	-0.66	1.26	1.86	▁▇▂▁▆

Data summary

Name	sdata_test_pp
Number of rows	512
Number of columns	34
_______________________
Column type frequency:
character	1
numeric	33
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
BrandCode	0	1	1	1	0	5	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	1237.47	705.94	5.00	638.00	1248.50	1803.75	2571.00	▇▇▇▇▆
PH	1	8.55	0.17	7.98	8.44	8.54	8.68	8.94	▁▃▇▇▃
CarbVolume	1	0.01	1.01	-3.12	-0.71	-0.19	0.80	2.52	▁▃▇▅▂
FillOunces	1	0.03	1.01	-3.20	-0.55	-0.01	0.60	3.99	▁▅▇▂▁
PCVolume	1	-0.01	0.93	-3.47	-0.55	-0.04	0.58	2.79	▁▂▇▅▁
CarbPressure	1	0.00	0.99	-3.54	-0.72	-0.02	0.71	2.74	▁▂▇▇▁
CarbTemp	1	-0.02	1.00	-3.47	-0.65	-0.09	0.64	2.90	▁▂▇▅▁
PSC	1	-0.05	1.06	-2.53	-0.81	-0.08	0.67	2.76	▂▇▇▅▂
PSCFill	1	-0.02	0.99	-1.66	-0.81	-0.13	0.54	3.43	▆▇▃▂▁
PSCCO2	1	0.07	1.03	-1.31	-0.84	-0.37	0.56	4.31	▇▅▂▁▁
MnfFlow	1	-0.06	0.98	-1.05	-1.05	-0.22	0.92	1.69	▇▁▁▇▁
CarbPressure1	1	-0.02	1.06	-3.46	-0.90	0.09	0.65	3.34	▁▅▇▅▁
FillPressure	1	0.02	1.04	-3.82	-0.58	-0.45	0.70	3.41	▁▁▇▅▁
HydPressure1	1	-0.03	1.01	-1.07	-1.01	-0.12	0.57	3.21	▇▅▂▁▁
HydPressure2	1	-0.04	1.00	-1.29	-1.29	0.42	0.81	2.08	▇▁▆▆▁
HydPressure3	1	-0.05	1.00	-1.37	-1.29	0.40	0.79	1.81	▇▁▃▇▁
HydPressure4	1	0.07	0.99	-2.16	-0.58	0.23	0.52	2.82	▂▃▇▂▁
FillerLevel	1	-0.07	1.05	-2.48	-1.22	0.57	0.69	3.08	▂▃▇▁▁
FillerSpeed	1	-0.16	1.18	-3.43	0.05	0.35	0.44	0.48	▁▁▁▁▇
Temperature	1	0.06	1.05	-1.91	-0.57	-0.10	0.38	6.30	▆▇▁▁▁
Usagecont	1	-0.05	1.01	-2.54	-1.04	0.19	0.94	1.78	▁▇▅▇▇
CarbFlow	1	0.08	1.00	-2.02	-1.25	0.54	0.74	1.43	▃▁▁▇▅
Density	1	0.02	1.00	-4.14	-0.68	-0.41	1.18	1.68	▁▁▆▇▇
MFR	1	-0.01	0.97	-6.40	-0.04	0.29	0.39	2.91	▁▁▁▇▁
Balling	1	-0.01	1.01	-3.70	-0.82	-0.50	1.19	1.63	▁▁▇▂▆
PressureVacuum	1	0.04	1.00	-2.42	-0.67	0.04	0.39	2.84	▂▇▆▃▂
OxygenFiller	1	0.09	0.98	-1.98	-0.27	0.08	0.69	3.21	▃▇▇▂▁
BowlSetpoint	1	-0.09	1.08	-2.39	-1.35	0.70	0.70	2.40	▂▃▂▇▁
PressureSetpoint	1	0.02	0.98	-1.94	-0.77	-0.77	1.16	1.28	▁▇▁▁▆
AirPressurer	1	0.04	1.05	-1.76	-0.53	-0.18	0.16	3.91	▂▇▁▁▁
AlchRel	1	0.00	0.99	-1.08	-0.72	-0.67	0.68	2.74	▇▁▂▃▁
CarbRel	1	-0.01	1.00	-3.67	-0.75	-0.26	0.83	3.02	▁▂▇▅▁
BallingLvl	1	0.01	1.01	-2.36	-0.77	-0.68	1.26	1.65	▁▆▇▁▇

Data summary

Name	edata_test_pp
Number of rows	267
Number of columns	34
_______________________
Column type frequency:
character	1
logical	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
BrandCode	0	1	1	1	0	5	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
PH	267	0	NaN	:

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	134.00	77.22	1.00	67.50	134.00	200.50	267.00	▇▇▇▇▇
CarbVolume	1	-0.02	1.04	-2.23	-0.81	-0.26	0.89	2.63	▂▇▅▅▁
FillOunces	1	-0.04	0.86	-2.60	-0.62	-0.09	0.45	2.60	▁▅▇▃▁
PCVolume	1	0.01	1.04	-3.28	-0.70	0.01	0.75	2.82	▁▃▇▇▁
CarbPressure	1	0.01	1.08	-2.42	-0.80	-0.02	0.70	2.46	▂▆▇▅▂
CarbTemp	1	0.03	1.06	-3.04	-0.65	-0.04	0.71	2.90	▁▃▇▃▁
PSC	1	-0.01	1.08	-2.53	-0.75	0.01	0.66	2.55	▂▅▇▃▂
PSCFill	1	-0.05	0.96	-1.49	-0.73	-0.13	0.46	3.60	▆▇▃▁▁
PSCCO2	1	-0.12	0.89	-1.31	-0.84	-0.37	0.09	4.31	▇▃▂▁▁
MnfFlow	1	-0.04	0.98	-1.05	-1.05	-0.22	0.96	1.62	▇▁▁▆▂
CarbPressure1	1	0.09	0.94	-2.05	-0.53	0.17	0.60	2.87	▂▃▇▂▁
FillPressure	1	0.06	1.10	-3.82	-0.58	0.01	0.76	3.37	▁▁▇▆▁
HydPressure1	1	-0.04	1.09	-5.04	-1.01	-0.17	0.63	3.02	▁▁▇▆▂
HydPressure2	1	-0.07	1.05	-4.34	-1.29	0.35	0.83	2.46	▁▁▆▇▁
HydPressure3	1	-0.07	1.04	-4.42	-1.29	0.44	0.77	1.79	▁▁▆▃▇
HydPressure4	1	0.13	1.05	-2.63	-0.41	0.23	0.66	2.73	▁▂▇▃▂
FillerLevel	1	0.06	1.01	-2.32	-0.66	0.56	0.69	3.51	▃▃▇▁▁
FillerSpeed	1	-0.23	1.25	-3.43	0.04	0.28	0.44	0.48	▁▁▁▁▇
Temperature	1	0.21	1.20	-1.74	-0.40	-0.09	0.53	6.09	▇▇▂▁▁
Usagecont	1	-0.05	1.01	-2.37	-1.02	0.06	0.94	1.29	▁▅▃▂▇
CarbFlow	1	-0.03	1.07	-2.02	-1.36	0.53	0.73	1.46	▅▁▁▇▃
Density	1	-0.02	1.14	-9.22	-0.64	-0.41	1.14	1.58	▁▁▁▃▇
MFR	1	-0.05	1.16	-6.71	-0.01	0.29	0.41	1.50	▁▁▁▁▇
Balling	1	0.00	1.00	-1.98	-0.74	-0.50	1.16	1.54	▁▇▃▁▇
PressureVacuum	1	0.08	1.02	-2.07	-0.67	0.04	0.74	2.84	▁▇▆▃▁
OxygenFiller	1	0.02	0.99	-1.98	-0.41	0.05	0.55	3.20	▅▇▇▂▁
BowlSetpoint	1	0.00	0.99	-2.39	-0.73	0.70	0.70	1.52	▂▂▃▇▁
PressureSetpoint	1	0.06	1.00	-1.94	-0.77	-0.77	1.16	1.96	▁▇▁▆▁
AirPressurer	1	0.01	1.03	-1.40	-0.53	-0.18	-0.01	3.60	▅▇▁▁▁
AlchRel	1	0.01	0.98	-1.08	-0.72	-0.62	0.68	1.74	▇▁▁▁▃
CarbRel	1	0.02	0.99	-2.14	-0.75	-0.26	0.97	2.24	▁▇▃▃▃
BallingLvl	1	0.00	1.02	-2.36	-0.77	-0.66	1.19	1.58	▁▃▇▁▇

Data visualization

First, we consider a correlation matrix where pairwise complete observations are allowed for use in computing correlations. Due to the low percentage of incomplete cells in the dataset, we prefer this dropping observations with any incomplete columns.

The correlation matrix below uses hierarchical clustering of the predictors to form 6 groups of variables using the corrplot package. This was the most efficient way to gain insight on the related clusters of variables.

M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")

corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )

We compare the correlations to the cleaned pre-processed training data.

Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )

Exporting the Post-Processed Data Sets

Our last step of preprocessing is to export the data sets. This need only but done once.

readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )

readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )

readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )

The above data files have the following properties and relationships to the original raw files:

Pre-Processed Data Files

filename	primary_key	response	numeric_columns	num_rows	file_source
sdata_train_pp.csv	id - consistent with sdata_test_pp	PH - unchanged	BoxCox, center, scaled, imputed	2055	StudentData.xlsx
sdata_test_pp.csv	id - consistent with sdata_train_pp	PH - unchanged	BoxCox, center, scaled, imputed	512	StudentData.xlsx
edata_test_pp.csv	id - standalone	PH - unchanged	BoxCox, center, scaled, imputed	267	StudentEvaluation.xlsx

An important note about the id column. The id in the file edata_test_pp.csv is the row number of the same observation in StudentEvaluation.xlsx. But the same is not true for sdata_train_pp.csv and sdata_test_pp.csv because they are sampled from the source file StudentData.xlsx. The 10th row of sdata_test_pp.csv is not necessarily the 10th row of StudentData.xlsx rather one should use the id field as the corresponding row number of the original file.

Importing the Pre-Processed Data

We load the pre-processed datasets for training and testing in this section. We load the csv file for the training data set below and display the first few column characteristics from the file.

sdata_train_pp = read_csv("sdata_train_pp.csv", 
                          col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_test_pp  = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_trainX_pp  = sdata_train_pp %>% select( -PH, -id )
sdata_testX_pp   = sdata_test_pp %>% select(-PH , -id)

tuningFull = TRUE   # Set to TRUE if running the lengthy tuning process for all models

Model Building

Gradient Boosted Trees

We consider the same dataset and variable importance problem using gradient boosted trees. The gbm library and variable importance measures are run within caret training procedure. The gradient boosting approach relies on a weak learner in an additive manner that minimizes the given loss function.

set.seed(1027)

gbmControlFull <- trainControl(method = "repeatedcv", 
                           number = 10,  # for debug mode, set this to a low number (1)
                           repeats = 10 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

gbmControlLight <- trainControl(method = "repeatedcv", 
                           number = 5,  # for debug mode, set this to a low number (1)
                           repeats = 1 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

if(tuningFull == TRUE)
{
   gbmControl = gbmControlFull
   gbmTune = expand.grid( n.trees = seq(250, 1000, by = 250),
                                                  shrinkage = c( 0.02, 0.05,  0.1) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 15
                                                  )
} else  {
   gbmControl = gbmControlLight
   gbmTune =  expand.grid( n.trees = c( 2000), 
                                                  shrinkage = c( 0.02, 0.05 , 0.1 ) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 4
                                                  )
}

(gbmTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "gbm",
                          tuneGrid =  gbmTune ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = gbmControl ) )

## Stochastic Gradient Boosting 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   0.02       3                   250     0.1434366  0.3445322  0.1130661
##   0.02       3                   500     0.1399034  0.3781615  0.1093662
##   0.02       3                   750     0.1381810  0.3948955  0.1075668
##   0.02       3                  1000     0.1371433  0.4053153  0.1065292
##   0.02       7                   250     0.1375662  0.3998266  0.1072462
##   0.02       7                   500     0.1339808  0.4329093  0.1037283
##   0.02       7                   750     0.1326817  0.4453548  0.1025127
##   0.02       7                  1000     0.1320709  0.4515327  0.1019375
##   0.05       3                   250     0.1391827  0.3846115  0.1085754
##   0.05       3                   500     0.1370326  0.4073776  0.1063487
##   0.05       3                   750     0.1364019  0.4163811  0.1056864
##   0.05       3                  1000     0.1360931  0.4213869  0.1053766
##   0.05       7                   250     0.1337931  0.4330732  0.1032574
##   0.05       7                   500     0.1324935  0.4472241  0.1018857
##   0.05       7                   750     0.1319768  0.4531566  0.1013501
##   0.05       7                  1000     0.1317495  0.4562800  0.1010483
##   0.10       3                   250     0.1379832  0.3995082  0.1072565
##   0.10       3                   500     0.1371777  0.4135910  0.1063966
##   0.10       3                   750     0.1368283  0.4201240  0.1059677
##   0.10       3                  1000     0.1366484  0.4244918  0.1056588
##   0.10       7                   250     0.1343586  0.4341016  0.1034324
##   0.10       7                   500     0.1335822  0.4437819  0.1027288
##   0.10       7                   750     0.1332925  0.4472134  0.1023291
##   0.10       7                  1000     0.1332832  0.4482130  0.1022717
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 15
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  7, shrinkage = 0.05 and n.minobsinnode = 15.

As the model output indicates the greater tree depth outperforms the lower counts and the shrinkage value of 0.05 produces the best RMSE value. The shrinkage, or learning rate, typically works better with smaller values, yet require longer processing time.

## Saving 7 x 5 in image

The plot of the tuned model clearly shows the better performance with the increased max tree depth but also indicates the lower shrinkage value doesn’t guarantee a lower RMSE. For the max tree depth of 3 and 7, the lowest shrinkage value does not produce the lowest RMSE for that tree depth.

## Saving 7 x 5 in image

The variable importance plot shows the 100% importance of the MnfFlow followed by BrandCode as the only other variable with an importance above 50%. Usagecont, OxygenFiller, AlchRel, and Temperature round out the top 6 variables based on importance with scores greater than 25%.

Overall, the final Gradient Boosted model results in an RMSE of approximately 0.132 and an \(R^2\) less than 0.5.

Cubist

Now we train a Cubist tree using the best selectionFunction on RMSE. The Cubist library and variable importance measures are run within caret training procedure.
The Cubist model is a rules-based approach in which the terminal leaves contains linear regression models. The models are based on predictions defined in the previous nodes of the tree.

set.seed(1027)

cubistControlFull <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridFull  <- expand.grid( committees = c( 10, 50, 100 ) ,
                              neighbors = c( 0, 1, 5, 9 )
                                                  ) 

cubistControlLight <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridLight <- expand.grid( committees = c( 10, 20 ) , 
                              neighbors = c( 0, 5 )  )

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}

(cubistTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )

## Cubist 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE        Rsquared   MAE       
##    10         0          0.10326126  0.6469501  0.07521436
##    10         1          0.10698108  0.6394174  0.07408043
##    10         5          0.09661241  0.6865169  0.06923965
##    10         9          0.09715807  0.6838488  0.06954861
##    50         0          0.10213076  0.6574062  0.07362497
##    50         1          0.10583349  0.6449193  0.07208654
##    50         5          0.09539318  0.6943587  0.06757449
##    50         9          0.09598916  0.6917813  0.06811918
##   100         0          0.10218813  0.6576706  0.07366847
##   100         1          0.10587327  0.6449503  0.07201655
##   100         5          0.09547621  0.6941947  0.06752479
##   100         9          0.09606335  0.6917983  0.06810842
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 5.

The Cubist model results provide a strong RMSE value and \(R^2\) value with generally a higher number of committees and neighbors.

The line plot confirms the better model per RMSE with neighbors at 5 and committees set to 50.

## Saving 7 x 5 in image

The variable importance plot identifies MnfFlow as the most valuable variable followed by Balling, BallingLvl, AlchRel, and PressureVacuum to round out the top 5 variables. Overall, 7 variables resulted in over 50% importance indicating those variables are contained in over 50% of the rules defining the Cubist model.

## Saving 7 x 5 in image

The final Cubist model confirms the high usage rate of the variable MnfFlow.

Cubist Predictor Usage

Conditions	Model	Variable
82	61	MnfFlow
18	81	Balling
22	71	BallingLvl
19	73	AlchRel
28	60	PressureVacuum
19	63	OxygenFiller
14	66	Density
21	51	BowlSetpoint
8	61	HydPressure3
7	62	Temperature
9	59	AirPressurer
18	47	CarbFlow
6	58	CarbRel
3	54	HydPressure2
3	52	CarbPressure1
11	40	Usagecont
16	31	FillerSpeed
6	39	HydPressure1
41	0	BrandCode
3	33	FillerLevel
0	27	CarbPressure
3	20	CarbVolume
2	21	PCVolume
0	20	CarbTemp
0	20	PressureSetpoint
0	17	FillPressure
0	16	HydPressure4
0	16	PSCFill
0	6	MFR
0	5	FillOunces
0	4	PSC
0	3	PSCCO2

The splits plot indicates the usage of each variable by committee/rule. The values are normalized and highlight the high usage rate of MnfFlow along with the use of greater than or less than inequalities for MnfFlow.

## quartz_off_screen 
##                 2

The dotplot allows us to see the distribution of the coefficient parameters for each linear model associated with a committee/rule.
When a predictor has more dots, it is used in more models which suggests the variable is influential. The color density is proportional to the variable importance. The sign of the coefficient predicts the marginal effect of the predictor on the response PH. However, since Cubist uses many models, the concept of the sign of a predictor coefficient can be unclear.
The x-axis value of the dot in the dotplot indicates the true coefficient in the original units unlike the above plot which normalizes the x-axis.

## quartz_off_screen 
##                 2

The boxplot below which we devised constructs a distribution of coefficients for each predictor. We interpret the sign of the predictor \(X\) marginal effect to be the sign of the median of the distribution of coefficients pf \(X\) over those models which include \(X\). The predictors have been sorted by their median.

The plot provides clarity regarding the impact of a given variable as the median is a good estimate of the sign of the variable coefficient. The AlchRel variable box indicates a positive relationship with the dependent variable PH. The most important variable MnfFlow box results in a negative coefficient, and thus an inverse effect on PH. For all the rules in the final Cubist model, the understanding of the coefficient values for each variable provides a clear relationship to the dependent variable despite the obtuse nature of the Cubist model construction.

## Saving 7 x 5 in image

Overall, the final Cubist model results in an RMSE under 0.1 and a \(R^2\) just below of 0.7.

CART

Using the rpart library and its training methods, we consider the best CART based on the RMSE metric. The CART approach follows a classification and regression tree methodology.

set.seed(10393)

library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 20, by = 1 )  )

cartControl <- trainControl(method = "boot", 
                            number = 30,  
                            selectionFunction = "best")

(rpartTune = train( x = sdata_trainX_pp, 
                    y = sdata_train_pp$PH ,
                   method = "rpart2" ,
                   metric = "RMSE" ,
                   tuneGrid = cartGrid ,
                   trControl = cartControl ) )

## CART 
## 
## 2055 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (30 reps) 
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE       Rsquared   MAE      
##    1        0.1540934  0.2150028  0.1207013
##    2        0.1485972  0.2697600  0.1163125
##    3        0.1439126  0.3150026  0.1130389
##    4        0.1418976  0.3346403  0.1106321
##    5        0.1408007  0.3455111  0.1095038
##    6        0.1398046  0.3557469  0.1085050
##    7        0.1389994  0.3635874  0.1074293
##    8        0.1381955  0.3717156  0.1065335
##    9        0.1373228  0.3802125  0.1057161
##   10        0.1362384  0.3899605  0.1046780
##   11        0.1353910  0.3979319  0.1038336
##   12        0.1345705  0.4055253  0.1028965
##   13        0.1340562  0.4106769  0.1022712
##   14        0.1335638  0.4154690  0.1017284
##   15        0.1330891  0.4202569  0.1012114
##   16        0.1327996  0.4230057  0.1008957
##   17        0.1327006  0.4239743  0.1008136
##   18        0.1326532  0.4245372  0.1007342
##   19        0.1326119  0.4250877  0.1006977
##   20        0.1326119  0.4250877  0.1006977
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 19.

The model output shows increasing tree depth leads to improved RMSE and \(R^2\) values.

The below plot displays the improving RMSE value with the increasing tree depth with an optimal tree depth of 19.

## Saving 7 x 5 in image

Once again, the variable MnfFlow is the most important variable with an importance rating of 100%. Six variables achieve an importance score of greater than 50% with BrandCode, Temperature, and CarbRel each achieving greater than 75% importance.

## Saving 7 x 5 in image

The CART tree plot of the final model reaffirms the importance of the variable MnfFlow as the initial node splits on MnfFlow. One of the second level nodes is BrandCode, another highly important variable to the CART model. Eight ff the top 10 most important variables in the Cubist model are found in the top 15 variables of the CART model. This overlap of variable importance shows a consistency across the different model approaches.

Overall, the final CART model results in an RMSE of approximately 0.132 and an \(R^2\) less than 0.5.

MARS

Now we train a MARS model based on the implementation of earth from the caret library. The MARS approach is a non-linear regression model that does not rely on the original variables but instead on derived features. The tuned MARS model defines the derived features as linear combinations of the initial variables.

## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598    RSS 27.25932    GRSq 0.5260713    RSq 0.5555922

## Call: earth(x=tbl_df[2055,32], y=c(8.36,8.26,8.9...), keepxy=TRUE, degree=2,
##             nprune=27)
## 
##                                                         coefficients
## (Intercept)                                                8.3958776
## h(-0.215889-MnfFlow)                                       0.1780017
## h(HydPressure3- -1.29159)                                  0.0391037
## h(-0.728732-Temperature)                                  -0.0805473
## h(Temperature- -0.728732)                                 -0.0477016
## h(-0.658787-Balling)                                       0.3588441
## h(-0.832915-OxygenFiller)                                 -0.0445311
## h(OxygenFiller- -0.832915)                                -0.0417267
## h(BowlSetpoint- -1.34947)                                  0.0724302
## h(0.640105-AlchRel)                                        0.0395901
## h(AlchRel-0.640105)                                        0.1293063
## h(1.05286-MnfFlow) * BrandCodeC                           -0.0902482
## h(MnfFlow-1.05286) * BrandCodeC                           -1.2433037
## h(-0.215889-MnfFlow) * h(PressureVacuum-0.386629)         -0.0625259
## h(-0.215889-MnfFlow) * h(0.386629-PressureVacuum)         -0.1069156
## h(-0.215889-MnfFlow) * h(AirPressurer-0.839882)           -0.1043474
## h(-0.215889-MnfFlow) * h(0.839882-AirPressurer)            0.0605090
## h(MnfFlow- -0.215889) * h(0.640105-AlchRel)               -0.0669114
## h(1.75812-CarbPressure1) * h(BowlSetpoint- -1.34947)      -0.0120925
## h(0.1987-HydPressure1) * h(-0.658787-Balling)             -0.2036569
## h(0.21892-FillerSpeed) * h(Balling- -0.658787)             0.0158849
## h(FillerSpeed-0.21892) * h(Balling- -0.658787)             0.2165359
## h(Temperature-1.27402) * h(BowlSetpoint- -1.34947)         0.0275573
## h(Usagecont-0.0816887) * h(Balling- -0.658787)            -0.0381250
## h(Density-0.482509) * h(BowlSetpoint- -1.34947)           -0.0578527
## h(0.386629-PressureVacuum) * h(0.640105-AlchRel)           0.0204706
## h(OxygenFiller- -0.832915) * h(AirPressurer- -0.700718)    0.0305074
## 
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598    RSS 27.25932    GRSq 0.5260713    RSq 0.5555922

The plot of the final MARS model indicates increasing the number of terms leads to a lower RMSE.

## Saving 7 x 5 in image

Once again, the variable MnfFlow achieves an importance score of 100%. Only 3 variables reach the score of 50%, the aforementioned MnfFlow, BrandCodeC, and AlchRel.

The variable MnfFlow is influential because the variable effects many other variables in the final model. Due to the numerous linear combinations including MnfFlow, the relationship to PH is difficult to ascertain. The MnfFlow variable requires more investigation into the variable interactions to determine the impact to PH for the MARS model.

## Saving 7 x 5 in image

Overall, the MARS model reaffirms the importance of the variable MnfFlow and also results in an RMSE and \(R^2\) in line with the previous models for Gradient Boosted Trees and CART.

Linear Regression

We also run a simple OLS regression on PH in terms of the scaled and centered variables. While its predictive performance is weaker than the other models on the training set, the coefficients are much easier to interpret.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51932 -0.07897  0.00917  0.08626  0.44987 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.586130   0.010428 823.364  < 2e-16 ***
## CarbVolume       -0.005488   0.008389  -0.654 0.513047    
## FillOunces       -0.004105   0.003163  -1.298 0.194451    
## PCVolume         -0.006448   0.003678  -1.753 0.079740 .  
## CarbPressure     -0.003088   0.011767  -0.262 0.793033    
## CarbTemp          0.008475   0.010712   0.791 0.428936    
## PSC              -0.003168   0.003147  -1.007 0.314160    
## PSCFill          -0.005762   0.003066  -1.879 0.060361 .  
## PSCCO2           -0.006170   0.003020  -2.043 0.041221 *  
## MnfFlow          -0.089502   0.006520 -13.728  < 2e-16 ***
## CarbPressure1     0.030465   0.003680   8.278 2.25e-16 ***
## FillPressure      0.005601   0.004501   1.244 0.213528    
## HydPressure1      0.004482   0.005177   0.866 0.386658    
## HydPressure2     -0.027126   0.009963  -2.723 0.006533 ** 
## HydPressure3      0.059090   0.010506   5.624 2.12e-08 ***
## HydPressure4      0.001523   0.004624   0.329 0.741982    
## FillerLevel      -0.021768   0.008733  -2.493 0.012760 *  
## FillerSpeed       0.003863   0.004923   0.785 0.432809    
## Temperature      -0.019234   0.003564  -5.397 7.59e-08 ***
## Usagecont        -0.020299   0.003911  -5.191 2.31e-07 ***
## CarbFlow          0.012499   0.004515   2.768 0.005685 ** 
## Density          -0.018774   0.009023  -2.081 0.037598 *  
## MFR               0.002030   0.003898   0.521 0.602651    
## Balling          -0.091337   0.014856  -6.148 9.43e-10 ***
## PressureVacuum   -0.016446   0.004588  -3.585 0.000345 ***
## OxygenFiller     -0.015182   0.004217  -3.600 0.000326 ***
## BowlSetpoint      0.053190   0.009135   5.823 6.72e-09 ***
## PressureSetpoint -0.013892   0.004500  -3.087 0.002047 ** 
## AirPressurer     -0.002649   0.003204  -0.827 0.408420    
## AlchRel           0.034578   0.011814   2.927 0.003463 ** 
## CarbRel           0.005015   0.006593   0.761 0.447003    
## BallingLvl        0.097856   0.017857   5.480 4.79e-08 ***
## BrandCodeA       -0.095705   0.025430  -3.764 0.000172 ***
## BrandCodeC       -0.139671   0.010233 -13.649  < 2e-16 ***
## BrandCodeD       -0.043954   0.029625  -1.484 0.138054    
## BrandCodeE       -0.057782   0.015435  -3.744 0.000186 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1325 on 2019 degrees of freedom
## Multiple R-squared:  0.4223, Adjusted R-squared:  0.4123 
## F-statistic: 42.16 on 35 and 2019 DF,  p-value: < 2.2e-16

The most significant variables are MnfFlow, CarbPressure, HydPressure3, Temperature, Usagecont, Balling, PressureVacuum, OxygenFiller, BowlSetpoint, BallingLvl, and BrandCode. Seven of these 11 variables are found in the top 10 most important variables of the Cubist model. Upon closer evaluation, the median value of the variable coefficients from the top 10 of the Cubist model match the sign of the coefficient value in the linear regression model. This finding further supports the assessment of the Cubist model variable impacts on the dependent variable PH.

We summarize the plot of coefficients below.

## NULL

The diagnostic plots of the OLS model suggest that the histogram of residuals is relatively symmetric and looks normal.
The QQ plot suggests the model fit does not fully capture the fat tails possibly due to a small number of outliers.

Overall, the linear regression model does not perform as well as the Cubist model but does provide a valuable sanity check and baseline to the results from the more advanced models.

Model Selection

Based on the above results, we summarize their training and test RMSE and \(R^2\) below.

The joint results are tabulated below.

# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()

model_list = list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)

model_stats = data.frame()

for( modelObj in model_list )
{
    if( modelObj$method != 'lm' ){
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
    }
    else
    {
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
    }
    output <- data.frame( modelName = modelObj$method ,
                     trainRMSE = getTrainPerf(modelObj)[1,1] ,
                     trainR2   = getTrainPerf(modelObj)[1,2] ,
                     testRMSE  = caret::RMSE( pred_data,  sdata_test_pp$PH) ,
                     testR2    = caret::R2(   pred_data,  sdata_test_pp$PH) 
                     )

    model_stats <- rbind(model_stats, output)
}

Model Results Comparison

modelName	trainRMSE	trainR2	testRMSE	testR2
cubist	0.095	0.694	0.090	0.722
rpart2	0.133	0.425	0.125	0.474
lm	0.134	0.396	0.127	0.449
gbm	0.132	0.456	0.151	0.342
earth	0.125	0.486	0.153	0.279

## save_kable will have the best result with magick installed.

We also inspect the plot of the predicted versus the observed data for the 5 models. The scatterplots below confirm the Cubist model outperforms the others on both the training set in the upper row and also the test data on the lower rows. The distribution of the Cubist plot mostly closely aligns with the 45-degree line which corresponds to an accurate prediction.

Discussion

We conclude that the best model is Cubist but the multilinear regression model, MARS and CART models provide useful insight on the main predictors.

We would recommend the use of the Cubist model for prediction
However, the OLS (ordinary least squares) linear regression help us to understand the sign and magnitude of the most important variables.
The MARS model helps to detect non-linearity and predictor interaction in prediction.

The top 5 variables that impact PH are:

MnfFlow - this appears negatively related to PH level based on OLS, MARS and Cubist median plot.
BallingLvl is positively related to PH level.
PreserveVacuum is negatively related to PH.
AlchRel is positively related to PH.
Balling is negatively related to PH.

Lastly, we observe that Brand is probably not a manufacturing predictor that can be controlled. Mostly likely Brand refers to a recipe for the final product. However, knowing the desired optimal PH for each Brand may help us control the most influential variables that control PH, thereby providing better quality control. While there are over 30 predictors in the dataset, we recommend looking at the top 10 most influential predictors. Most of the models agree substantially on which predictors to include in the top 10.

For example, the linear regression model coefficients agree with the sign and variable importance of the top 10 predictors of the Cubist model. Importance predictors of the OLS model have the highest statistical significance - consistent with variable importance measure of caret.

Final Predictions

## Cubist 
## 
## 2567 samples
##   32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2310, 2309, 2310, 2310, 2310, 2311, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE        Rsquared   MAE       
##    10         0          0.09786899  0.6814835  0.07063050
##    10         1          0.10347326  0.6605109  0.07157433
##    10         5          0.09205621  0.7148932  0.06565672
##    10         9          0.09275054  0.7104184  0.06629558
##    50         0          0.09733927  0.6877932  0.07034005
##    50         1          0.10284267  0.6632483  0.07070141
##    50         5          0.09105650  0.7209833  0.06455347
##    50         9          0.09179098  0.7169196  0.06532125
##   100         0          0.09682465  0.6925441  0.07016957
##   100         1          0.10246532  0.6656340  0.07043498
##   100         5          0.09042487  0.7248834  0.06419195
##   100         9          0.09121584  0.7206745  0.06503557
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 100 and neighbors = 5.

RMSE of the full StudentData dataset: 0.0904249.

\(R^2\) of the full StudentData dataset: 0.7248834.

Code

We summarize all the R code used in this project in this appendix for ease of reading.

knitr::opts_chunk$set(echo = FALSE)
# Conditional Code Evaluation is managed here.

suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(rpart.plot))
suppressPackageStartupMessages(library(partykit))
suppressPackageStartupMessages(library(gbm))
library(cowplot)
library(skimr)
library(GGally) # for ggpairs
library(readxl) # to parse Excel workbooks
library(openxlsx) # for writing xlsx files
library(corrplot)
library(RColorBrewer)
library(caret)
library(VSURF)
library(Cubist)
library(rpart)
library(party)
library(writexl)
.code-bg {
  background-color: lightgray;
}

raw_StudentData <- read_excel("StudentData.xlsx")
dim(raw_StudentData)
skim(raw_StudentData) %>% 
  yank("numeric") %>% 
  arrange( complete_rate) %>% 
  kable( digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(raw_StudentData) %>% yank("character") %>%
  kable(  caption = "Character Variables in Beverage Data" , digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

raw_StudentData %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` ) %>%
  filter( !is.na(PH)) %>% as_tibble() ->  sdata_v1

dim(sdata_v1)
# include any rows where any column is NA.  Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id  ) %>% relocate(id)-> missing_train_rows
  
dim(missing_train_rows)
missing_train_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_train_summary

missing_train_summary %>% group_by(num_na) %>%
  summarize( count = n()) %>%
  kable( caption = "Number of Rows with NA") %>%
  kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells = sum( missing_train_summary$num_na)
num_cells = nrow(sdata_v1) * 32  # excludes the response variable PH since we dropped its values.

print(paste0("There are ", num_na_cells , " cells with NA values out of ", num_cells, " equivalent to: " , sprintf("%.2f%%", 100 * num_na_cells / num_cells ) ) )
raw_StudentEvaluation <- read_excel("StudentEvaluation.xlsx")
dim(raw_StudentEvaluation)
skim(raw_StudentEvaluation) %>% 
  yank("numeric") %>% 
  arrange( complete_rate) %>% 
  kable( digits = 3 ) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(raw_StudentEvaluation) %>% yank("character") %>%
  kable(  caption = "Character Variables in Beverage Data" , digits = 3) %>% 
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
raw_StudentEvaluation %>% 
  rename_with(~ str_replace(., " ", ""),  everything() ) %>%
  mutate( id = row_number() )  %>% 
  relocate(id) %>%
  relocate(PH, .after = id) %>%
  relocate(`BrandCode`, .after = `BallingLvl` )  %>% as_tibble() ->  edata_v1

dim(edata_v1)
# include any rows where any column is NA.  Always include id column
#
#  NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
#  for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1  

# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2

# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id  ) %>% relocate(id)-> missing_test_rows
  
dim(missing_test_rows)
missing_test_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_test_summary

missing_test_summary %>% group_by(num_na) %>%
  summarize( count = n()) %>%
  kable( caption = "Number of Rows with NA") %>%
  kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells_test = sum( missing_test_summary$num_na)
num_cells_test = nrow(edata_v1) * 32  # excludes the response variable PH since we dropped its values.

print(paste0("There are ", num_na_cells_test , " cells with NA values (excluding PH) out of ", num_cells_test, " equivalent to: " , 
             sprintf("%.2f%%", 100 * num_na_cells_test / num_cells_test ) ) )


skim_sdata = skim(sdata_v1)
skim_edata = skim(edata_v1)

skim_sdata %>% select(  skim_variable, n_missing, complete_rate) %>% inner_join(skim_edata %>% select(skim_variable, n_missing, complete_rate), by = c("skim_variable") ) %>% filter(skim_variable != 'PH' ) -> skim_complete_rates 

skim_complete_rates %>% rename( training_complete = complete_rate.x, testing_complete = complete_rate.y) -> skim_complete_rates
skim_complete_rates %>% as_tibble() %>%  ggplot(aes(x=training_complete, y = testing_complete)) + 
  geom_point() + geom_smooth(method = "lm") + theme(aspect.ratio = 1) + lims( x = c(0.95, 1), y = c(0.95, 1)) +
  labs(title = "Completion Rate by Predictor between Training and Test Sets")

# Feature plot for the numeric predictor variables against the result variable PH
cols <- sdata_v1 %>%
  select(-c('id', 'BrandCode', 'PH')) %>% colnames()

featurePlot(sdata_v1[,cols], 
            sdata_v1$PH, 
            plot="scatter",
            type = c("p", "smooth"),
            span = .5,
            layout=c(6,6))
#https://statisticsglobe.com/histogram-density-for-each-column-data-frame-r
data_long <- sdata_v1[,cols] %>%
  pivot_longer(cols) %>%
  as.data.frame()

ggp2 <- ggplot(data_long, aes(x = value)) +
  geom_density() +
  facet_wrap(~ name, scales="free") +
  labs(title = "Density Plot of Predictor Variables")
ggp2
# http://www.sthda.com/english/wiki/normality-test-in-r
# From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
shap_test_res <- lapply(sdata_v1[,cols], shapiro.test)
# Result: None of the numeric variables are normally distributed.

# https://stackoverflow.com/questions/62306712/how-to-select-only-the-p-value-0-05-after-performing-shapiro-wilk-test-in-rstud
subset_vector  <- sapply(shap_test_res, function(x) x$p.value > .05)
results_subset <- shap_test_res[subset_vector]

length(results_subset)
ggplot(data = sdata_v1) +
  geom_bar(mapping = aes(x = `BrandCode`)) +
  labs(title = "Distribution of Brand Code")
ggplot(data = sdata_v1, mapping = aes(x = `BrandCode`, y = PH)) +
  geom_boxplot() +
  labs(title = "Boxplot of Brand Code")
sdata_v1_no_nas <- sdata_v1 %>% drop_na()

bev.vsurf <- VSURF(sdata_v1_no_nas[,cols], 
                   sdata_v1_no_nas$PH,
                   ntree = 10,
                   nfor.thres = 20,
                   nfor.interp = 10, nfor.pred = 10)
bev.vsurf
summary(bev.vsurf)
bev.vsurf$varselect.pred
set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )

sdata_test    = sdata_v1[-train_indices , ]
sdata_train   = sdata_v1[ train_indices , ]
sdataY_test   = sdata_v1$PH[-train_indices ]
sdataY_train  = sdata_v1$PH[ train_indices ]

nearZeroVar(sdata_train, saveMetrics = TRUE) %>% 
  kable(caption = "No Near Zero Variance Predictors in Training", digits = 2 ) %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

skim(sdata_train) %>% 
 mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
          zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
          zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
          )  %>%
  filter( zscore_extreme > 4 , skim_variable != 'PH') %>%
  arrange( desc(zscore_extreme)) %>%
  dplyr::select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd, numeric.hist, zscore_min, zscore_max ) %>%
  kable(digits = 2, caption = "Predictors with Outliers by Z-Score" ) %>%
  kable_styling(bootstrap_options = c("hover" , "striped"), position = "left")
pl1 = sdata_train %>% ggplot(aes(x=PH, y = MFR)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl2 = sdata_train %>% ggplot(aes(x=PH, y = OxygenFiller)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl3 = sdata_train %>% ggplot(aes(x=PH, y = Temperature)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl4 = sdata_train %>% ggplot(aes(x=PH, y = CarbRel)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl5 = sdata_train %>% ggplot(aes(x=PH, y = AirPressurer)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl6 = sdata_train %>% ggplot(aes(x=PH, y = PSCCO2)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl7 = sdata_train %>% ggplot(aes(x=PH, y = FillPressure)) + geom_point(size=0.3) + theme(aspect.ratio = 1)

plotlist= list(pl1, pl2, pl3, pl4, pl5, pl6, pl7)
plot_grid(plotlist=plotlist, nrow = 2 )


median_MFR = median( sdata_train$MFR  , na.rm = TRUE ) 

sdata_train %>% 
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode))  -> sdata_v2_train

sdata_test  %>% 
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode))  -> sdata_v2_test

edata_v1 %>%
  mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
  mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) ->  edata_v2



# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected.  They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale",  "knnImpute") )

# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc,  as.data.frame(sdata_v2_train ) )

# Becomes the final version of test data for validation
sdata_test_pp  = predict( preProcFunc,  as.data.frame(sdata_v2_test) )

# Need to generate predictions based on test data without known response
edata_test_pp  = predict( preProcFunc,  as.data.frame(edata_v2) )


sdata_trainX_pp  = sdata_train_pp %>% select( -PH)
sdata_testX_pp   = sdata_test_pp %>% select(-PH)
edata_testX_pp   = edata_test_pp %>% select( -PH)

skim(sdata_train_pp)
skim(sdata_test_pp)
skim(edata_test_pp)
M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")

corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )
Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 ,  tl.cex = 0.7 )


readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )

readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )

readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )


file_df = tibble( filename = c("sdata_train_pp.csv", "sdata_test_pp.csv", "edata_test_pp.csv"),
                  primary_key = c("id - consistent with sdata_test_pp", "id - consistent with sdata_train_pp", "id - standalone") ,
                  response = c("PH - unchanged", "PH - unchanged", "PH - unchanged" ) ,
                  
                  numeric_columns = c("BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed") ,
                  num_rows = c(2055, 512, 267 ) ,
                  file_source = c("StudentData.xlsx", "StudentData.xlsx", "StudentEvaluation.xlsx")
)

file_df %>% kable(caption = "Pre-Processed Data Files") %>% kable_styling(bootstrap_options = c("hover", "striped"
                                                                                            ), 
                                                                          position = "left")


sdata_train_pp = read_csv("sdata_train_pp.csv", 
                          col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_test_pp  = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

sdata_trainX_pp  = sdata_train_pp %>% select( -PH, -id )
sdata_testX_pp   = sdata_test_pp %>% select(-PH , -id)
tuningFull = TRUE   # Set to TRUE if running the lengthy tuning process for all models
set.seed(1027)

gbmControlFull <- trainControl(method = "repeatedcv", 
                           number = 10,  # for debug mode, set this to a low number (1)
                           repeats = 10 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

gbmControlLight <- trainControl(method = "repeatedcv", 
                           number = 5,  # for debug mode, set this to a low number (1)
                           repeats = 1 ,
                           selectionFunction = "best", 
                           verboseIter = FALSE)

if(tuningFull == TRUE)
{
   gbmControl = gbmControlFull
   gbmTune = expand.grid( n.trees = seq(250, 1000, by = 250),
                                                  shrinkage = c( 0.02, 0.05,  0.1) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 15
                                                  )
} else  {
   gbmControl = gbmControlLight
   gbmTune =  expand.grid( n.trees = c( 2000), 
                                                  shrinkage = c( 0.02, 0.05 , 0.1 ) ,
                                                  interaction.depth = c(  3, 7) ,
                                                  n.minobsinnode = 4
                                                  )
}

(gbmTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "gbm",
                          tuneGrid =  gbmTune ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = gbmControl ) )
gbm_plot <- ggplot(gbmTune) + labs(title = "Gradient Boosted Trees - Tuning")
gbm_plot
ggsave("GradientBoostedTreesTurning.jpg", gbm_plot)
gbm_vi_plot <- ggplot(varImp(gbmTune, numTrees = gbmTune$finalModel$n.trees) ) + labs(title = "Gradient Boosted Trees - Variance Importance")
gbm_vi_plot
ggsave("GbmVarImp.jpg", gbm_vi_plot)
set.seed(1027)

cubistControlFull <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridFull  <- expand.grid( committees = c( 10, 50, 100 ) ,
                              neighbors = c( 0, 1, 5, 9 )
                                                  ) 

cubistControlLight <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridLight <- expand.grid( committees = c( 10, 20 ) , 
                              neighbors = c( 0, 5 )  )

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}

(cubistTune = caret::train( x = sdata_trainX_pp, 
                          y = sdata_train_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )
cubist_plot <- ggplot(cubistTune) + labs(title="Cubist Model Tuning")
cubist_plot
ggsave("CubistTuning.jpg", cubist_plot)
cubist_vi_plot <- ggplot(varImp(cubistTune), top = 20) + labs(title = "Cubist Model Variable Importance")
cubist_vi_plot
ggsave("CubistVarImp.jpg", cubist_vi_plot)
cubistTune$finalModel$usage %>% arrange( desc(Model+ Conditions)) %>% 
  kable(caption = "Cubist Predictor Usage") %>% kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
cubist_splitsplot <- dotplot( cubistTune$finalModel, what = "splits" )
cubist_splitsplot
# Output to file
jpeg(file="CubistSplitsPlot.jpg")
cubist_splitsplot
dev.off()
cubist_dotplot <- dotplot( cubistTune$finalModel, 
         what = "coefs", 
         between = list(x = 0.2, y = 0.5) , 
         scales = list(x = list(relation = "free"),  
                       y = list(cex = 0.25)  ) )
cubist_dotplot
jpeg(file="CubistDotPlot.jpg")
cubist_splitsplot
dev.off()
cubistTune$finalModel$coefficients %>% 
  rownames_to_column(var="id") %>% 
  select(-committee, -rule) %>% 
  pivot_longer(!id, names_to = "predictor", values_to = "coef" ) %>% 
  filter( !is.na(coef)) %>% 
  filter( predictor != '(Intercept)' )  %>%
  filter(abs(coef) < 20 ) -> coef_piv

cubist_boxplot <- coef_piv %>% 
  ggplot() + 
  geom_boxplot(
    aes(x=reorder(predictor, coef, FUN = median, na.rm = TRUE ),  # order the boxplots by median
        y = coef ), outlier.shape = NA ) +  # strip out the outliers
  coord_flip(ylim=c(-.5,.5)) +
  labs(title = "Distribution of Coefficients of Predictors", 
       subtitle = "from Cubist Committee/Rules",
       x = "Predictors" ,
       y = "Coefficient in Linear Model"
       )

cubist_boxplot
ggsave("CubistBoxPlot.jpg", cubist_boxplot)
set.seed(10393)

library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 20, by = 1 )  )

cartControl <- trainControl(method = "boot", 
                            number = 30,  
                            selectionFunction = "best")

(rpartTune = train( x = sdata_trainX_pp, 
                    y = sdata_train_pp$PH ,
                   method = "rpart2" ,
                   metric = "RMSE" ,
                   tuneGrid = cartGrid ,
                   trControl = cartControl ) )
cart_tuning_plot <- ggplot(rpartTune) + labs(title="CART Tree Tuning")
cart_tuning_plot
ggsave("CartTuningPlot.jpg", cart_tuning_plot)
cart_vi_plot <- ggplot(varImp( rpartTune), top = 20 ) + labs(title = "CART Tree Variable Importance")
cart_vi_plot
ggsave("CartVarImp.jpg", cart_vi_plot)
rpart.plot(rpartTune$finalModel, digits = 3)

if( tuningFull == TRUE )
{
    marsGrid = expand.grid( .degree = 1:2, .nprune = seq(2,27, by=5) )
    marsControl = trainControl(method = "cv", 
                     number = 10, 
                
                     selectionFunction = "best", 
                     verboseIter = FALSE)
} else {
    marsGrid = expand.grid( .degree = 1:2, .nprune = seq(2,25, by = 10 ) )
    
    marsControl = trainControl(method = "cv", 
                     number = 10, 
                     
                     selectionFunction = "best", 
                     verboseIter = FALSE)
}
    

set.seed(1000)

marsTuned = caret::train(x = sdata_trainX_pp , 
                         y = sdata_train_pp$PH, 
                         method = "earth" ,
                         metric = "RMSE" , 
                         tuneGrid = marsGrid ,
                         trControl = marsControl )
marsTuned$finalModel

summary(marsTuned)
mars_tuning_plot <- ggplot(marsTuned) + labs(title="MARS Model Tuning" )
mars_tuning_plot
ggsave("MarsTuningPlot.jpg", mars_tuning_plot)
mars_vi_plot <- ggplot( varImp(marsTuned), 20 ) + labs(title = "MARS Model Variable Importance")
mars_vi_plot
ggsave("MarsVarImp.jpg", mars_vi_plot)
set.seed(123)

train.control <- trainControl( method = "cv", number = 5)

linearTune = train( PH ~ . -id, 
                    data = sdata_train_pp, 
                    method = "lm", 
                    tuneGrid = data.frame(intercept = TRUE),
                    trControl = train.control)

summary(linearTune$finalModel)
par(mfrow=c(2,2))

lm_final_model_plot <- plot(linearTune$finalModel)
lm_final_model_plot
#ggsave("LinearModelFinalPlot.jpg", lm_final_model_plot)
hist(linearTune$finalModel$residuals)

# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()

model_list = list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)

model_stats = data.frame()

for( modelObj in model_list )
{
    if( modelObj$method != 'lm' ){
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
    }
    else
    {
        pred_data <- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
    }
    output <- data.frame( modelName = modelObj$method ,
                     trainRMSE = getTrainPerf(modelObj)[1,1] ,
                     trainR2   = getTrainPerf(modelObj)[1,2] ,
                     testRMSE  = caret::RMSE( pred_data,  sdata_test_pp$PH) ,
                     testR2    = caret::R2(   pred_data,  sdata_test_pp$PH) 
                     )

    model_stats <- rbind(model_stats, output)
}

results_table <- model_stats %>% as_tibble() %>% 
  arrange( testRMSE) %>%
  kable( digits = 3, caption = "Model Results Comparison") %>%
  kable_styling(bootstrap_options = c("hover", "striped"), position = "left")

results_table

save_kable(results_table, "ModelResultsTable.jpg")
predVals = extractPrediction(list(marsTuned, rpartTune, cubistTune, gbmTune), testX = sdata_testX_pp , testY = sdata_test_pp$PH )

# Attempting to hack into the extract prediction object with the LM data
# Cols: obs pred model dataType object
pred_lm =  predict(linearTune, newdata = sdata_test_pp)
pred_lm_train <- linearTune$finalModel$fitted.values

# Create placeholder data for the training values
model_l <- rep("LM", length = 2055)
dt_l <- rep("Training", length = 2055)
obj_l <- rep("Object9", length = 2055)

# Create dataframe of the training data of LM
lm_train_preds <- as.data.frame(cbind(obs=sdata_train_pp$PH, pred=pred_lm_train, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)

# Create placeholder data for the test values
model_l <- rep("LM", length = 512)
dt_l <- rep("Test", length = 512)
obj_l <- rep("Object10", length = 512)

# Create dataframe of the test data of LM
lm_test_preds <- as.data.frame(cbind(obs=sdata_test_pp$PH, pred=pred_lm, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)

# Combine all data for plotting
all_preds <- rbind(predVals, lm_train_preds, lm_test_preds)

# Ensure obs and preds are numerics
all_preds$obs <- as.numeric(all_preds$obs)
all_preds$pred <- as.numeric(all_preds$pred)

# Plot using Caret built-in reporting.
# --------------------------------------
obsVsPredsPlot <- plotObsVsPred(all_preds)
obsVsPredsPlot
#ggsave("ObsVsPredsPlot.jpg", obsVsPredsPlot)
# Combine the initial training and test datasets into 1
sdata_trainfull_pp <- rbind(sdata_train_pp, sdata_test_pp)
sdata_trainfullX_pp <- sdata_trainfull_pp %>% select(-PH , -id)

edata_test_pp = read_csv("edata_test_pp.csv", 
                          col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% 
  as_tibble() %>% remove_rownames()

# Train the Cubist model
set.seed(1027)

if(tuningFull == TRUE)
{
   cubistControl = cubistControlFull
   cubistGrid = tuneGridFull
} else  {
   cubistControl = cubistControlLight
   cubistGrid = tuneGridLight
}

(cubistTune = caret::train( x = sdata_trainfullX_pp, 
                          y = sdata_trainfull_pp$PH , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )

# Make predictions on the provided dataset
final_pred_data <- as.numeric( predict( cubistTune, newdata=edata_test_pp ))

# Output prediction performance
full_cubist_train_rmse <- getTrainPerf(cubistTune)[1,1]
full_cubist_train_r2   <- getTrainPerf(cubistTune)[1,2]
# Output predictions to Excel file
raw_StudentEvaluation <- read_excel("StudentEvaluation.xlsx")
raw_StudentEvaluation$PH <- final_pred_data
write_xlsx(raw_StudentEvaluation, "ANg_PTanofsky_Ph_Predictions.xlsx")

Technical Report: Determinants of PH in Beverage Process

Alexander Ng, Philip Tanofsky

Due 12/13/2021

Introduction

Data Wrangling Strategy

Rows with Missing Data

Evaluation of Missing Data in the Test Set

Exploratory Data Analysis

Scatterplot of Predictor Variables

Density Plots

Normality Test

Distribution by Brand Code

Boxplot of PH by Brand Code

VSURF

Data Partition into Training and Test Sets

Drop Near Zero Variance Predictors

Outliers and Zeros

Handle Special Cases

KNN Imputation

Data visualization

Exporting the Post-Processed Data Sets

Importing the Pre-Processed Data

Model Building

Gradient Boosted Trees

Cubist

CART

MARS

Linear Regression

Model Selection

Discussion

Final Predictions

Code