This technical report tunes multiple models for prediction of beverage PH and identifies determinants of PH among predictors in the manufacturing process. We train and test 5 models for this purpose including the Cubist, Gradient Boosted Tree, CART, MARS and linear regression models.
The training and test performance are evaluated separately on the StudentData dataset with the metrics RMSE and \(R^2\) to determine the model algorithm most appropriate for predicting the PH of the StudentEvaluation dataset.
We will describe the data wrangling steps in brief before doing any exploratory data analysis and model building. After loading the raw data files, our transformations and changes are as follows:
id
to allow unique identification of all raw observations.PH
.PH
value into a 80/20 train and validate data set. We recognize that a test data set exists but this is for external assessment of the project prediction accuracy.BrandCode
categorical value with a dummy BrandCode E
MFR
missing value with the training set median.Along the way, we will assess the numerical impact of such changes to the dataset.
## [1] 2571 33
Let’s evaluate the data characteristics of the raw file after loading. We use the skimr
library to get statistics quickly for the entire training data set. The table below shows the numeric variables sorted by their completion rate in descending order. So the most problematic predictors are at the top of the table.
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
MFR | 212 | 0.918 | 704.049 | 73.898 | 31.400 | 706.300 | 724.000 | 731.000 | 868.600 | ▁▁▁▂▇ |
Filler Speed | 57 | 0.978 | 3687.199 | 770.820 | 998.000 | 3888.000 | 3982.000 | 3998.000 | 4030.000 | ▁▁▁▁▇ |
PC Volume | 39 | 0.985 | 0.277 | 0.061 | 0.079 | 0.239 | 0.271 | 0.312 | 0.478 | ▁▃▇▂▁ |
PSC CO2 | 39 | 0.985 | 0.056 | 0.043 | 0.000 | 0.020 | 0.040 | 0.080 | 0.240 | ▇▅▂▁▁ |
Fill Ounces | 38 | 0.985 | 23.975 | 0.088 | 23.633 | 23.920 | 23.973 | 24.027 | 24.320 | ▁▂▇▂▁ |
PSC | 33 | 0.987 | 0.085 | 0.049 | 0.002 | 0.048 | 0.076 | 0.112 | 0.270 | ▆▇▃▁▁ |
Carb Pressure1 | 32 | 0.988 | 122.586 | 4.743 | 105.600 | 119.000 | 123.200 | 125.400 | 140.200 | ▁▃▇▂▁ |
Hyd Pressure4 | 30 | 0.988 | 96.289 | 13.123 | 52.000 | 86.000 | 96.000 | 102.000 | 142.000 | ▁▃▇▂▁ |
Carb Pressure | 27 | 0.989 | 68.190 | 3.538 | 57.000 | 65.600 | 68.200 | 70.600 | 79.400 | ▁▅▇▃▁ |
Carb Temp | 26 | 0.990 | 141.095 | 4.037 | 128.600 | 138.400 | 140.800 | 143.800 | 154.000 | ▁▅▇▃▁ |
PSC Fill | 23 | 0.991 | 0.195 | 0.118 | 0.000 | 0.100 | 0.180 | 0.260 | 0.620 | ▆▇▃▁▁ |
Fill Pressure | 22 | 0.991 | 47.922 | 3.178 | 34.600 | 46.000 | 46.400 | 50.000 | 60.400 | ▁▁▇▂▁ |
Filler Level | 20 | 0.992 | 109.252 | 15.698 | 55.800 | 98.300 | 118.400 | 120.000 | 161.200 | ▁▃▅▇▁ |
Hyd Pressure2 | 15 | 0.994 | 20.961 | 16.386 | 0.000 | 0.000 | 28.600 | 34.600 | 59.400 | ▇▂▇▅▁ |
Hyd Pressure3 | 15 | 0.994 | 20.458 | 15.976 | -1.200 | 0.000 | 27.600 | 33.400 | 50.000 | ▇▁▃▇▁ |
Temperature | 14 | 0.995 | 65.968 | 1.383 | 63.600 | 65.200 | 65.600 | 66.400 | 76.200 | ▇▃▁▁▁ |
Oxygen Filler | 12 | 0.995 | 0.047 | 0.047 | 0.002 | 0.022 | 0.033 | 0.060 | 0.400 | ▇▁▁▁▁ |
Pressure Setpoint | 12 | 0.995 | 47.615 | 2.039 | 44.000 | 46.000 | 46.000 | 50.000 | 52.000 | ▁▇▁▆▁ |
Hyd Pressure1 | 11 | 0.996 | 12.438 | 12.433 | -0.800 | 0.000 | 11.400 | 20.200 | 58.000 | ▇▅▂▁▁ |
Carb Volume | 10 | 0.996 | 5.370 | 0.106 | 5.040 | 5.293 | 5.347 | 5.453 | 5.700 | ▁▆▇▅▁ |
Carb Rel | 10 | 0.996 | 5.437 | 0.129 | 4.960 | 5.340 | 5.400 | 5.540 | 6.060 | ▁▇▇▂▁ |
Alch Rel | 9 | 0.996 | 6.897 | 0.505 | 5.280 | 6.540 | 6.560 | 7.240 | 8.620 | ▁▇▂▃▁ |
Usage cont | 5 | 0.998 | 20.993 | 2.978 | 12.080 | 18.360 | 21.790 | 23.755 | 25.900 | ▁▃▅▃▇ |
PH | 4 | 0.998 | 8.546 | 0.173 | 7.880 | 8.440 | 8.540 | 8.680 | 9.360 | ▁▅▇▂▁ |
Mnf Flow | 2 | 0.999 | 24.569 | 119.481 | -100.200 | -100.000 | 65.200 | 140.800 | 229.400 | ▇▁▁▇▂ |
Carb Flow | 2 | 0.999 | 2468.354 | 1073.696 | 26.000 | 1144.000 | 3028.000 | 3186.000 | 5104.000 | ▂▅▆▇▁ |
Bowl Setpoint | 2 | 0.999 | 109.327 | 15.303 | 70.000 | 100.000 | 120.000 | 120.000 | 140.000 | ▁▂▃▇▁ |
Density | 1 | 1.000 | 1.174 | 0.378 | 0.240 | 0.900 | 0.980 | 1.620 | 1.920 | ▁▅▇▂▆ |
Balling | 1 | 1.000 | 2.198 | 0.931 | -0.170 | 1.496 | 1.648 | 3.292 | 4.012 | ▁▇▇▁▇ |
Balling Lvl | 1 | 1.000 | 2.050 | 0.870 | 0.000 | 1.380 | 1.480 | 3.140 | 3.660 | ▁▇▂▁▆ |
Pressure Vacuum | 0 | 1.000 | -5.216 | 0.570 | -6.600 | -5.600 | -5.400 | -5.000 | -3.600 | ▂▇▆▂▁ |
Air Pressurer | 0 | 1.000 | 142.834 | 1.212 | 140.800 | 142.200 | 142.600 | 143.000 | 148.200 | ▅▇▁▁▁ |
Next, we display the one character variable Brand
below and its skimr
output below.
Character Variables in Beverage Data
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Brand Code | 120 | 0.953 | 1 | 1 | 0 | 4 | 0 |
After loading the raw Excel file, we find the training dataset has 2571 observations with 32 predictors and 1 response variable PH
.
General observations about the skimr
results:
MFR
is missing the most data: about 8% (212) of its observations are missing. The percentage of missing observations is too high for automated imputation in our opinion.Filler Speed
is missing the next most data: about 2.2% (57) of its observations are missing. The percentage is low enough for us to consider imputation.MFR
, Brand
, all of the other predictors have data completeness rates over 97.8%. Therefore, except for Brand
and MFR
, an automated data imputation policy can be considered for those columns.PH
has 4 missing observations. We will decide to drop those 4 observations as we do not wish to impute the response.Our first transformation below is to add a unique identifier id
based on row number to obtain a primary key.
Then we drop the 4 observations which are missing PH
values, rearrange the response variable PH
and BrandCode
for our convenience. Lastly, we also eliminate spaces from the predictor column names to avoid using quote in our code.
%>%
raw_StudentData rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>%
filter( !is.na(PH)) %>% as_tibble() -> sdata_v1
dim(sdata_v1)
## [1] 2567 34
Is there any observations with a high number of missing observations?
# include any rows where any column is NA. Always include id column
%>% dplyr::filter_all( any_vars(is.na(.))) -> x1
sdata_v1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
%>% select_if( ~ any(is.na(.) ) ) -> x2
x1
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
%>% mutate( id = x1$id ) %>% relocate(id)-> missing_train_rows
x2
dim(missing_train_rows)
## [1] 529 28
We find there are 529 observations with NA observations. So 20.6% have NA values. We conclude that the NA values are sufficiently well distributed that removal of all observations with NA values is impractical.
In the marginal table below, we investigate if certain observations have a high incidence of NA values amongst their predictors. The answer shows that the incident of multiple
Number of Rows with NA
num_na | count |
---|---|
1 | 345 |
2 | 119 |
3 | 45 |
4 | 13 |
5 | 3 |
6 | 2 |
7 | 1 |
8 | 1 |
## [1] "There are 812 cells with NA values out of 82144 equivalent to: 0.99%"
Overall, only 1 percent of cells have missing values.
The concentration of missing values seems to decline at a geometric ratio suggesting independence among the occurrence of missing values.
## [1] 267 33
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
MFR | 31 | 0.884 | 697.801 | 96.400 | 15.600 | 707.000 | 724.600 | 731.450 | 784.800 | ▁▁▁▁▇ |
Filler Speed | 10 | 0.963 | 3581.385 | 911.187 | 1006.000 | 3812.000 | 3978.000 | 3996.000 | 4020.000 | ▁▁▁▁▇ |
Fill Ounces | 6 | 0.978 | 23.970 | 0.076 | 23.747 | 23.920 | 23.967 | 24.013 | 24.200 | ▁▅▇▃▁ |
PSC | 5 | 0.981 | 0.085 | 0.053 | 0.004 | 0.044 | 0.076 | 0.112 | 0.246 | ▆▇▃▂▁ |
PSC CO2 | 5 | 0.981 | 0.051 | 0.038 | 0.000 | 0.020 | 0.040 | 0.060 | 0.240 | ▇▃▂▁▁ |
PC Volume | 4 | 0.985 | 0.278 | 0.063 | 0.099 | 0.233 | 0.275 | 0.322 | 0.464 | ▁▆▇▅▁ |
Carb Pressure1 | 4 | 0.985 | 123.040 | 4.420 | 113.000 | 120.200 | 123.400 | 125.500 | 136.000 | ▃▃▇▂▁ |
Hyd Pressure4 | 4 | 0.985 | 97.840 | 13.921 | 68.000 | 90.000 | 98.000 | 104.000 | 140.000 | ▅▆▇▂▁ |
PSC Fill | 3 | 0.989 | 0.190 | 0.113 | 0.020 | 0.100 | 0.180 | 0.260 | 0.620 | ▇▇▃▁▁ |
Oxygen Filler | 3 | 0.989 | 0.047 | 0.050 | 0.002 | 0.020 | 0.034 | 0.054 | 0.398 | ▇▁▁▁▁ |
Alch Rel | 3 | 0.989 | 6.907 | 0.498 | 6.400 | 6.540 | 6.580 | 7.180 | 7.820 | ▇▁▂▁▃ |
Fill Pressure | 2 | 0.993 | 48.141 | 3.440 | 37.800 | 46.000 | 47.800 | 50.200 | 60.200 | ▁▇▇▂▁ |
Filler Level | 2 | 0.993 | 110.292 | 15.502 | 69.200 | 100.600 | 118.600 | 120.200 | 153.200 | ▂▃▇▇▁ |
Temperature | 2 | 0.993 | 66.227 | 1.692 | 63.800 | 65.400 | 65.800 | 66.600 | 75.400 | ▇▅▁▁▁ |
Usage cont | 2 | 0.993 | 20.897 | 2.999 | 12.900 | 18.120 | 21.440 | 23.740 | 24.600 | ▁▃▃▃▇ |
Pressure Setpoint | 2 | 0.993 | 47.733 | 2.057 | 44.000 | 46.000 | 46.000 | 50.000 | 52.000 | ▁▇▁▆▁ |
Carb Rel | 2 | 0.993 | 5.440 | 0.127 | 5.180 | 5.340 | 5.400 | 5.560 | 5.740 | ▂▇▂▃▂ |
Carb Volume | 1 | 0.996 | 5.369 | 0.111 | 5.147 | 5.287 | 5.340 | 5.465 | 5.667 | ▂▇▃▅▁ |
Carb Temp | 1 | 0.996 | 141.226 | 4.296 | 130.000 | 138.400 | 140.800 | 143.800 | 154.000 | ▁▆▇▃▁ |
Hyd Pressure2 | 1 | 0.996 | 20.113 | 17.214 | -50.000 | 0.000 | 26.800 | 34.800 | 61.400 | ▁▁▆▇▁ |
Hyd Pressure3 | 1 | 0.996 | 19.609 | 16.557 | -50.000 | 0.000 | 27.700 | 33.000 | 49.200 | ▁▁▆▃▇ |
Density | 1 | 0.996 | 1.177 | 0.382 | 0.060 | 0.920 | 0.980 | 1.600 | 1.840 | ▁▁▇▁▅ |
Balling | 1 | 0.996 | 2.203 | 0.925 | 0.902 | 1.498 | 1.648 | 3.242 | 3.788 | ▅▇▁▂▅ |
Pressure Vacuum | 1 | 0.996 | -5.174 | 0.582 | -6.400 | -5.600 | -5.200 | -4.800 | -3.600 | ▁▇▆▃▁ |
Bowl Setpoint | 1 | 0.996 | 109.624 | 15.017 | 70.000 | 100.000 | 120.000 | 120.000 | 130.000 | ▁▂▁▃▇ |
Air Pressurer | 1 | 0.996 | 142.831 | 1.231 | 141.200 | 142.200 | 142.600 | 142.800 | 147.200 | ▅▇▁▁▁ |
Carb Pressure | 0 | 1.000 | 68.255 | 3.857 | 60.200 | 65.300 | 68.000 | 70.600 | 77.600 | ▃▆▇▃▂ |
Mnf Flow | 0 | 1.000 | 21.034 | 117.756 | -100.200 | -100.000 | 0.200 | 141.300 | 220.400 | ▇▁▁▆▂ |
Hyd Pressure1 | 0 | 1.000 | 12.010 | 13.531 | -50.000 | 0.000 | 10.400 | 20.400 | 50.000 | ▁▁▇▆▂ |
Carb Flow | 0 | 1.000 | 2408.644 | 1161.364 | 0.000 | 1083.000 | 3038.000 | 3215.000 | 3858.000 | ▂▃▁▆▇ |
Balling Lvl | 0 | 1.000 | 2.051 | 0.883 | 0.000 | 1.380 | 1.480 | 3.080 | 3.420 | ▁▃▇▁▇ |
Character Variables in Beverage Data
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Brand Code | 8 | 0.97 | 1 | 1 | 0 | 4 | 0 |
## [1] 267 34
# include any rows where any column is NA. Always include id column
#
# NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
# for all observations in the test set.
%>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1
edata_v1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
%>% select_if( ~ any(is.na(.) ) ) -> ex2
ex1
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
%>% mutate( id = ex1$id ) %>% relocate(id)-> missing_test_rows
ex2
dim(missing_test_rows)
## [1] 67 28
Number of Rows with NA
num_na | count |
---|---|
1 | 50 |
2 | 11 |
3 | 3 |
4 | 1 |
8 | 1 |
14 | 1 |
## [1] "There are 107 cells with NA values (excluding PH) out of 8544 equivalent to: 1.25%"
We conclude that the distribution of missing values in the test set is:
The scatterplots of the predictor variables against the dependent variable PH
indicate no clear linear relationship with any predictor variable. The scatterplots indicate some variable measurements are not continuous and instead fall on specific values on the x-axis.
The density plots of each predictor variable show many predictor variables with non-normal distributions.
The Shapiro-Wilk test of normality is performed on all the numeric predictor variables and the resulting list confirms no p-value above 0.05 and thus no predictor variable follows a normal distribution per evaluation.
## [1] 0
The distribution of the Brand Code, the lone non-numeric predictor variable, shows a high count of brand B, comprising almost half of the samples. Also note, some values are missing for the Brand Code variable.
The boxplot of PH values by Brand Code certainly captures overlap of the middle two quartiles, which seems reasonable given the PH range is rather narrow. Of note, the median values of the 4 labeled brands and the unlabeled group do appear distinct besides Brand A and the unlabeled.
To find significant variables for the models, the VSURF algorithm is applied to the provided dataset. The algorithm identifies 14 variables as significant to the predictions.
## Thresholding step
## Estimated computational time (on one core): 12.3 sec.
##
|
| | 0%
|
|==== | 5%
|
|======= | 10%
|
|========== | 15%
|
|============== | 20%
|
|================== | 25%
|
|===================== | 30%
|
|======================== | 35%
|
|============================ | 40%
|
|================================ | 45%
|
|=================================== | 50%
|
|====================================== | 55%
|
|========================================== | 60%
|
|============================================== | 65%
|
|================================================= | 70%
|
|==================================================== | 75%
|
|======================================================== | 80%
|
|============================================================ | 85%
|
|=============================================================== | 90%
|
|================================================================== | 95%
|
|======================================================================| 100%
## Interpretation step (on 26 variables)
## Estimated computational time (on one core): between 799.5 sec. and 2356.1 sec.
##
|
| | 0%
|
|=== | 4%
|
|===== | 8%
|
|======== | 12%
|
|=========== | 15%
|
|============= | 19%
|
|================ | 23%
|
|=================== | 27%
|
|====================== | 31%
|
|======================== | 35%
|
|=========================== | 38%
|
|============================== | 42%
|
|================================ | 46%
|
|=================================== | 50%
|
|====================================== | 54%
|
|======================================== | 58%
|
|=========================================== | 62%
|
|============================================== | 65%
|
|================================================ | 69%
|
|=================================================== | 73%
|
|====================================================== | 77%
|
|========================================================= | 81%
|
|=========================================================== | 85%
|
|============================================================== | 88%
|
|================================================================= | 92%
|
|=================================================================== | 96%
|
|======================================================================| 100%
## Prediction step (on 17 variables)
## Maximum estimated computational time (on one core): 970.2 sec.
##
|
| | 0%
|
|==== | 6%
|
|======== | 12%
|
|============ | 18%
|
|================ | 24%
|
|===================== | 29%
|
|========================= | 35%
|
|============================= | 41%
|
|================================= | 47%
|
|===================================== | 53%
|
|========================================= | 59%
|
|============================================= | 65%
|
|================================================= | 71%
|
|====================================================== | 76%
|
|========================================================== | 82%
|
|============================================================== | 88%
|
|================================================================== | 94%
|
|======================================================================| 100%
## ** VSURF results **
## The results object is a list of length around 20.
## Most interesting components are the following:
##
## name description
## 1 "$varselect.thres" "variables selected after thesholding step"
## 2 "$varselect.interp" "variables selected after interpretation step"
## 3 "$varselect.pred" "variables selected after prediction step"
## 4 "$ord.imp$x" "mean VI in decreasing order for all variables"
## 5 "$ord.imp$ix" "indices of the ordering of all variables VI mean"
## 6 "mean.perf" "mean OOB rate for RF build on all variables"
##
## For more information about VSURF outputs see the VSURF help page
##
## VSURF computation time: 36.2 mins
##
## VSURF selected:
## 26 variables at thresholding step (in 7.4 secs)
## 17 variables at interpretation step (in 23.4 mins)
## 16 variables at prediction step (in 12.6 mins)
## [1] 9 19 30 16 29 26 31 20 18 23 25 24 27 17 12 28
Next we partition the student evaluation data set by training and test sets. We will retain the terminology of test set to refer to the validation data set used for model performance assessment outside of the cross validation process.
set.seed(19372)
= createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )
train_indices
= sdata_v1[-train_indices , ]
sdata_test = sdata_v1[ train_indices , ]
sdata_train = sdata_v1$PH[-train_indices ]
sdataY_test = sdata_v1$PH[ train_indices ] sdataY_train
We conclude there are no near zero variance predictors or constant predictors to be dropped from the training data set.
No Near Zero Variance Predictors in Training
freqRatio | percentUnique | zeroVar | nzv | |
---|---|---|---|---|
id | 1.00 | 100.00 | FALSE | FALSE |
PH | 1.05 | 2.48 | FALSE | FALSE |
CarbVolume | 1.03 | 4.72 | FALSE | FALSE |
FillOunces | 1.18 | 4.38 | FALSE | FALSE |
PCVolume | 1.19 | 21.22 | FALSE | FALSE |
CarbPressure | 1.10 | 5.06 | FALSE | FALSE |
CarbTemp | 1.04 | 5.99 | FALSE | FALSE |
PSC | 1.14 | 6.23 | FALSE | FALSE |
PSCFill | 1.11 | 1.56 | FALSE | FALSE |
PSCCO2 | 1.08 | 0.63 | FALSE | FALSE |
MnfFlow | 1.08 | 21.75 | FALSE | FALSE |
CarbPressure1 | 1.02 | 6.37 | FALSE | FALSE |
FillPressure | 1.78 | 5.06 | FALSE | FALSE |
HydPressure1 | 31.86 | 11.58 | FALSE | FALSE |
HydPressure2 | 7.63 | 9.39 | FALSE | FALSE |
HydPressure3 | 12.71 | 8.91 | FALSE | FALSE |
HydPressure4 | 1.02 | 1.90 | FALSE | FALSE |
FillerLevel | 1.05 | 12.90 | FALSE | FALSE |
FillerSpeed | 1.00 | 10.41 | FALSE | FALSE |
Temperature | 1.05 | 2.63 | FALSE | FALSE |
Usagecont | 1.12 | 22.43 | FALSE | FALSE |
CarbFlow | 1.33 | 23.21 | FALSE | FALSE |
Density | 1.19 | 3.65 | FALSE | FALSE |
MFR | 1.04 | 25.94 | FALSE | FALSE |
Balling | 1.24 | 9.34 | FALSE | FALSE |
PressureVacuum | 1.43 | 0.78 | FALSE | FALSE |
OxygenFiller | 1.32 | 15.67 | FALSE | FALSE |
BowlSetpoint | 2.80 | 0.49 | FALSE | FALSE |
PressureSetpoint | 1.33 | 0.39 | FALSE | FALSE |
AirPressurer | 1.03 | 1.41 | FALSE | FALSE |
AlchRel | 1.10 | 2.53 | FALSE | FALSE |
CarbRel | 1.02 | 1.95 | FALSE | FALSE |
BallingLvl | 1.23 | 3.89 | FALSE | FALSE |
BrandCode | 2.01 | 0.19 | FALSE | FALSE |
Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim
to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum.
\[ \text{zscore_min}(X_i) = \frac{ min(X_i) - \mu(X_i) }{\sigma(X_i)}\] \[\text{zscore_max}(X_i) = \frac{ max(X_i) - \mu(X_i) }{\sigma(X_i )}\]
Predictors with Outliers by Z-Score
skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.hist | zscore_min | zscore_max |
---|---|---|---|---|---|---|---|
MFR | 156 | 0.92 | 704.15 | 74.72 | ▁▁▁▂▇ | -9.00 | 2.20 |
OxygenFiller | 7 | 1.00 | 0.05 | 0.04 | ▇▁▁▁▁ | -0.97 | 7.88 |
Temperature | 8 | 1.00 | 65.95 | 1.36 | ▇▃▁▁▁ | -1.73 | 7.55 |
CarbRel | 6 | 1.00 | 5.44 | 0.13 | ▁▇▇▂▁ | -3.70 | 4.84 |
AirPressurer | 0 | 1.00 | 142.83 | 1.20 | ▇▇▁▁▁ | -1.52 | 4.48 |
PSCCO2 | 29 | 0.99 | 0.06 | 0.04 | ▇▅▂▁▁ | -1.31 | 4.31 |
FillPressure | 15 | 0.99 | 47.91 | 3.14 | ▁▁▇▅▁ | -4.23 | 3.72 |
Using scatterplots of each of the 7 identified predictors below, we see no obvious data errors exist in the distribution of the predictors with outlier values.
Isolated points exist but lie within a reasonable proximity to neighboring points in the scatter plot. Therefore no data corrections for outliers will be applied to the training data.
The code below will transform the special case situations of data pre-processing identified earlier:
E
.These rules are applied consistently to be the training, validation and test data sets prior to subsequent pre processing.
We apply the caret pre processing function to that training, validation and test data sets.
# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected. They need to be cast as data frames.
# ---------------------------------------------------------------------------------
= preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale", "knnImpute") )
preProcFunc
# Becomes the source data for the model building
= predict( preProcFunc, as.data.frame(sdata_v2_train ) )
sdata_train_pp
# Becomes the final version of test data for validation
= predict( preProcFunc, as.data.frame(sdata_v2_test) )
sdata_test_pp
# Need to generate predictions based on test data without known response
= predict( preProcFunc, as.data.frame(edata_v2) )
edata_test_pp
= sdata_train_pp %>% select( -PH)
sdata_trainX_pp = sdata_test_pp %>% select(-PH)
sdata_testX_pp = edata_test_pp %>% select( -PH) edata_testX_pp
Name | sdata_train_pp |
Number of rows | 2055 |
Number of columns | 34 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 33 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 1297.57 | 751.06 | 1.00 | 644.50 | 1296.00 | 1961.00 | 2568.00 | ▇▇▇▇▇ |
PH | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▁▁ |
CarbVolume | 0 | 1 | 0.00 | 1.00 | -3.42 | -0.71 | -0.26 | 0.86 | 2.90 | ▁▃▇▅▁ |
FillOunces | 0 | 1 | 0.00 | 0.99 | -3.88 | -0.62 | -0.01 | 0.60 | 3.91 | ▁▂▇▂▁ |
PCVolume | 0 | 1 | 0.01 | 1.00 | -3.72 | -0.60 | -0.07 | 0.59 | 3.01 | ▁▂▇▃▁ |
CarbPressure | 0 | 1 | 0.01 | 1.00 | -3.11 | -0.72 | 0.04 | 0.70 | 2.88 | ▁▃▇▅▁ |
CarbTemp | 0 | 1 | 0.01 | 1.00 | -3.47 | -0.65 | 0.01 | 0.69 | 2.90 | ▁▂▇▅▁ |
PSC | 0 | 1 | 0.00 | 0.99 | -2.75 | -0.65 | 0.01 | 0.66 | 2.83 | ▁▅▇▅▁ |
PSCFill | 0 | 1 | 0.00 | 1.00 | -1.66 | -0.81 | -0.13 | 0.54 | 3.60 | ▆▇▃▁▁ |
PSCCO2 | 0 | 1 | 0.00 | 0.99 | -1.31 | -0.84 | -0.37 | 0.33 | 4.31 | ▇▅▂▁▁ |
MnfFlow | 0 | 1 | 0.00 | 1.00 | -1.05 | -1.05 | 0.50 | 0.97 | 1.58 | ▇▁▁▆▂ |
CarbPressure1 | 0 | 1 | -0.01 | 1.00 | -3.63 | -0.77 | 0.13 | 0.60 | 3.77 | ▁▃▇▂▁ |
FillPressure | 0 | 1 | 0.00 | 1.00 | -5.38 | -0.58 | -0.45 | 0.70 | 3.23 | ▁▁▇▆▁ |
HydPressure1 | 0 | 1 | 0.00 | 1.00 | -1.07 | -1.01 | -0.08 | 0.65 | 3.66 | ▇▅▂▁▁ |
HydPressure2 | 0 | 1 | 0.00 | 1.00 | -1.29 | -1.29 | 0.47 | 0.83 | 2.34 | ▇▂▇▅▁ |
HydPressure3 | 0 | 1 | 0.00 | 1.00 | -1.37 | -1.29 | 0.44 | 0.80 | 1.84 | ▇▁▃▇▁ |
HydPressure4 | 0 | 1 | 0.01 | 1.00 | -3.40 | -0.76 | 0.07 | 0.52 | 2.82 | ▁▅▇▇▁ |
FillerLevel | 0 | 1 | 0.00 | 1.00 | -2.85 | -0.72 | 0.54 | 0.68 | 4.30 | ▁▅▇▁▁ |
FillerSpeed | 0 | 1 | -0.05 | 1.05 | -3.44 | 0.20 | 0.41 | 0.44 | 0.51 | ▁▁▁▁▇ |
Temperature | 0 | 1 | 0.00 | 1.00 | -1.91 | -0.57 | -0.25 | 0.38 | 6.50 | ▇▆▁▁▁ |
Usagecont | 0 | 1 | 0.00 | 1.00 | -2.46 | -0.92 | 0.20 | 0.95 | 1.83 | ▁▇▅▇▆ |
CarbFlow | 0 | 1 | 0.00 | 1.00 | -2.02 | -1.33 | 0.52 | 0.69 | 2.99 | ▃▁▇▁▁ |
Density | 0 | 1 | 0.00 | 1.00 | -4.85 | -0.68 | -0.41 | 1.18 | 1.71 | ▁▁▁▇▅ |
MFR | 0 | 1 | 0.00 | 1.00 | -6.70 | -0.02 | 0.29 | 0.41 | 3.35 | ▁▁▁▇▁ |
Balling | 0 | 1 | 0.00 | 1.00 | -6.22 | -0.74 | -0.50 | 1.19 | 1.68 | ▁▁▁▇▅ |
PressureVacuum | 0 | 1 | 0.00 | 1.00 | -2.42 | -0.67 | -0.31 | 0.39 | 2.84 | ▂▇▅▃▂ |
OxygenFiller | 0 | 1 | 0.00 | 1.00 | -1.98 | -0.35 | 0.03 | 0.62 | 3.21 | ▃▇▆▂▁ |
BowlSetpoint | 0 | 1 | 0.00 | 1.00 | -2.39 | -0.73 | 0.70 | 0.70 | 2.40 | ▁▃▃▇▁ |
PressureSetpoint | 0 | 1 | 0.00 | 1.00 | -1.94 | -0.77 | -0.77 | 1.16 | 1.96 | ▁▇▁▆▁ |
AirPressurer | 0 | 1 | 0.00 | 1.00 | -1.58 | -0.53 | -0.18 | 0.16 | 4.37 | ▅▇▁▁▁ |
AlchRel | 0 | 1 | 0.00 | 1.00 | -5.09 | -0.72 | -0.62 | 0.79 | 2.76 | ▁▁▇▂▃ |
CarbRel | 0 | 1 | 0.00 | 1.00 | -4.28 | -0.75 | -0.26 | 0.97 | 4.21 | ▁▂▇▅▁ |
BallingLvl | 0 | 1 | 0.00 | 1.00 | -2.36 | -0.77 | -0.66 | 1.26 | 1.86 | ▁▇▂▁▆ |
Name | sdata_test_pp |
Number of rows | 512 |
Number of columns | 34 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 33 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 1237.47 | 705.94 | 5.00 | 638.00 | 1248.50 | 1803.75 | 2571.00 | ▇▇▇▇▆ |
PH | 0 | 1 | 8.55 | 0.17 | 7.98 | 8.44 | 8.54 | 8.68 | 8.94 | ▁▃▇▇▃ |
CarbVolume | 0 | 1 | 0.01 | 1.01 | -3.12 | -0.71 | -0.19 | 0.80 | 2.52 | ▁▃▇▅▂ |
FillOunces | 0 | 1 | 0.03 | 1.01 | -3.20 | -0.55 | -0.01 | 0.60 | 3.99 | ▁▅▇▂▁ |
PCVolume | 0 | 1 | -0.01 | 0.93 | -3.47 | -0.55 | -0.04 | 0.58 | 2.79 | ▁▂▇▅▁ |
CarbPressure | 0 | 1 | 0.00 | 0.99 | -3.54 | -0.72 | -0.02 | 0.71 | 2.74 | ▁▂▇▇▁ |
CarbTemp | 0 | 1 | -0.02 | 1.00 | -3.47 | -0.65 | -0.09 | 0.64 | 2.90 | ▁▂▇▅▁ |
PSC | 0 | 1 | -0.05 | 1.06 | -2.53 | -0.81 | -0.08 | 0.67 | 2.76 | ▂▇▇▅▂ |
PSCFill | 0 | 1 | -0.02 | 0.99 | -1.66 | -0.81 | -0.13 | 0.54 | 3.43 | ▆▇▃▂▁ |
PSCCO2 | 0 | 1 | 0.07 | 1.03 | -1.31 | -0.84 | -0.37 | 0.56 | 4.31 | ▇▅▂▁▁ |
MnfFlow | 0 | 1 | -0.06 | 0.98 | -1.05 | -1.05 | -0.22 | 0.92 | 1.69 | ▇▁▁▇▁ |
CarbPressure1 | 0 | 1 | -0.02 | 1.06 | -3.46 | -0.90 | 0.09 | 0.65 | 3.34 | ▁▅▇▅▁ |
FillPressure | 0 | 1 | 0.02 | 1.04 | -3.82 | -0.58 | -0.45 | 0.70 | 3.41 | ▁▁▇▅▁ |
HydPressure1 | 0 | 1 | -0.03 | 1.01 | -1.07 | -1.01 | -0.12 | 0.57 | 3.21 | ▇▅▂▁▁ |
HydPressure2 | 0 | 1 | -0.04 | 1.00 | -1.29 | -1.29 | 0.42 | 0.81 | 2.08 | ▇▁▆▆▁ |
HydPressure3 | 0 | 1 | -0.05 | 1.00 | -1.37 | -1.29 | 0.40 | 0.79 | 1.81 | ▇▁▃▇▁ |
HydPressure4 | 0 | 1 | 0.07 | 0.99 | -2.16 | -0.58 | 0.23 | 0.52 | 2.82 | ▂▃▇▂▁ |
FillerLevel | 0 | 1 | -0.07 | 1.05 | -2.48 | -1.22 | 0.57 | 0.69 | 3.08 | ▂▃▇▁▁ |
FillerSpeed | 0 | 1 | -0.16 | 1.18 | -3.43 | 0.05 | 0.35 | 0.44 | 0.48 | ▁▁▁▁▇ |
Temperature | 0 | 1 | 0.06 | 1.05 | -1.91 | -0.57 | -0.10 | 0.38 | 6.30 | ▆▇▁▁▁ |
Usagecont | 0 | 1 | -0.05 | 1.01 | -2.54 | -1.04 | 0.19 | 0.94 | 1.78 | ▁▇▅▇▇ |
CarbFlow | 0 | 1 | 0.08 | 1.00 | -2.02 | -1.25 | 0.54 | 0.74 | 1.43 | ▃▁▁▇▅ |
Density | 0 | 1 | 0.02 | 1.00 | -4.14 | -0.68 | -0.41 | 1.18 | 1.68 | ▁▁▆▇▇ |
MFR | 0 | 1 | -0.01 | 0.97 | -6.40 | -0.04 | 0.29 | 0.39 | 2.91 | ▁▁▁▇▁ |
Balling | 0 | 1 | -0.01 | 1.01 | -3.70 | -0.82 | -0.50 | 1.19 | 1.63 | ▁▁▇▂▆ |
PressureVacuum | 0 | 1 | 0.04 | 1.00 | -2.42 | -0.67 | 0.04 | 0.39 | 2.84 | ▂▇▆▃▂ |
OxygenFiller | 0 | 1 | 0.09 | 0.98 | -1.98 | -0.27 | 0.08 | 0.69 | 3.21 | ▃▇▇▂▁ |
BowlSetpoint | 0 | 1 | -0.09 | 1.08 | -2.39 | -1.35 | 0.70 | 0.70 | 2.40 | ▂▃▂▇▁ |
PressureSetpoint | 0 | 1 | 0.02 | 0.98 | -1.94 | -0.77 | -0.77 | 1.16 | 1.28 | ▁▇▁▁▆ |
AirPressurer | 0 | 1 | 0.04 | 1.05 | -1.76 | -0.53 | -0.18 | 0.16 | 3.91 | ▂▇▁▁▁ |
AlchRel | 0 | 1 | 0.00 | 0.99 | -1.08 | -0.72 | -0.67 | 0.68 | 2.74 | ▇▁▂▃▁ |
CarbRel | 0 | 1 | -0.01 | 1.00 | -3.67 | -0.75 | -0.26 | 0.83 | 3.02 | ▁▂▇▅▁ |
BallingLvl | 0 | 1 | 0.01 | 1.01 | -2.36 | -0.77 | -0.68 | 1.26 | 1.65 | ▁▆▇▁▇ |
Name | edata_test_pp |
Number of rows | 267 |
Number of columns | 34 |
_______________________ | |
Column type frequency: | |
character | 1 |
logical | 1 |
numeric | 32 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
PH | 267 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 134.00 | 77.22 | 1.00 | 67.50 | 134.00 | 200.50 | 267.00 | ▇▇▇▇▇ |
CarbVolume | 0 | 1 | -0.02 | 1.04 | -2.23 | -0.81 | -0.26 | 0.89 | 2.63 | ▂▇▅▅▁ |
FillOunces | 0 | 1 | -0.04 | 0.86 | -2.60 | -0.62 | -0.09 | 0.45 | 2.60 | ▁▅▇▃▁ |
PCVolume | 0 | 1 | 0.01 | 1.04 | -3.28 | -0.70 | 0.01 | 0.75 | 2.82 | ▁▃▇▇▁ |
CarbPressure | 0 | 1 | 0.01 | 1.08 | -2.42 | -0.80 | -0.02 | 0.70 | 2.46 | ▂▆▇▅▂ |
CarbTemp | 0 | 1 | 0.03 | 1.06 | -3.04 | -0.65 | -0.04 | 0.71 | 2.90 | ▁▃▇▃▁ |
PSC | 0 | 1 | -0.01 | 1.08 | -2.53 | -0.75 | 0.01 | 0.66 | 2.55 | ▂▅▇▃▂ |
PSCFill | 0 | 1 | -0.05 | 0.96 | -1.49 | -0.73 | -0.13 | 0.46 | 3.60 | ▆▇▃▁▁ |
PSCCO2 | 0 | 1 | -0.12 | 0.89 | -1.31 | -0.84 | -0.37 | 0.09 | 4.31 | ▇▃▂▁▁ |
MnfFlow | 0 | 1 | -0.04 | 0.98 | -1.05 | -1.05 | -0.22 | 0.96 | 1.62 | ▇▁▁▆▂ |
CarbPressure1 | 0 | 1 | 0.09 | 0.94 | -2.05 | -0.53 | 0.17 | 0.60 | 2.87 | ▂▃▇▂▁ |
FillPressure | 0 | 1 | 0.06 | 1.10 | -3.82 | -0.58 | 0.01 | 0.76 | 3.37 | ▁▁▇▆▁ |
HydPressure1 | 0 | 1 | -0.04 | 1.09 | -5.04 | -1.01 | -0.17 | 0.63 | 3.02 | ▁▁▇▆▂ |
HydPressure2 | 0 | 1 | -0.07 | 1.05 | -4.34 | -1.29 | 0.35 | 0.83 | 2.46 | ▁▁▆▇▁ |
HydPressure3 | 0 | 1 | -0.07 | 1.04 | -4.42 | -1.29 | 0.44 | 0.77 | 1.79 | ▁▁▆▃▇ |
HydPressure4 | 0 | 1 | 0.13 | 1.05 | -2.63 | -0.41 | 0.23 | 0.66 | 2.73 | ▁▂▇▃▂ |
FillerLevel | 0 | 1 | 0.06 | 1.01 | -2.32 | -0.66 | 0.56 | 0.69 | 3.51 | ▃▃▇▁▁ |
FillerSpeed | 0 | 1 | -0.23 | 1.25 | -3.43 | 0.04 | 0.28 | 0.44 | 0.48 | ▁▁▁▁▇ |
Temperature | 0 | 1 | 0.21 | 1.20 | -1.74 | -0.40 | -0.09 | 0.53 | 6.09 | ▇▇▂▁▁ |
Usagecont | 0 | 1 | -0.05 | 1.01 | -2.37 | -1.02 | 0.06 | 0.94 | 1.29 | ▁▅▃▂▇ |
CarbFlow | 0 | 1 | -0.03 | 1.07 | -2.02 | -1.36 | 0.53 | 0.73 | 1.46 | ▅▁▁▇▃ |
Density | 0 | 1 | -0.02 | 1.14 | -9.22 | -0.64 | -0.41 | 1.14 | 1.58 | ▁▁▁▃▇ |
MFR | 0 | 1 | -0.05 | 1.16 | -6.71 | -0.01 | 0.29 | 0.41 | 1.50 | ▁▁▁▁▇ |
Balling | 0 | 1 | 0.00 | 1.00 | -1.98 | -0.74 | -0.50 | 1.16 | 1.54 | ▁▇▃▁▇ |
PressureVacuum | 0 | 1 | 0.08 | 1.02 | -2.07 | -0.67 | 0.04 | 0.74 | 2.84 | ▁▇▆▃▁ |
OxygenFiller | 0 | 1 | 0.02 | 0.99 | -1.98 | -0.41 | 0.05 | 0.55 | 3.20 | ▅▇▇▂▁ |
BowlSetpoint | 0 | 1 | 0.00 | 0.99 | -2.39 | -0.73 | 0.70 | 0.70 | 1.52 | ▂▂▃▇▁ |
PressureSetpoint | 0 | 1 | 0.06 | 1.00 | -1.94 | -0.77 | -0.77 | 1.16 | 1.96 | ▁▇▁▆▁ |
AirPressurer | 0 | 1 | 0.01 | 1.03 | -1.40 | -0.53 | -0.18 | -0.01 | 3.60 | ▅▇▁▁▁ |
AlchRel | 0 | 1 | 0.01 | 0.98 | -1.08 | -0.72 | -0.62 | 0.68 | 1.74 | ▇▁▁▁▃ |
CarbRel | 0 | 1 | 0.02 | 0.99 | -2.14 | -0.75 | -0.26 | 0.97 | 2.24 | ▁▇▃▃▃ |
BallingLvl | 0 | 1 | 0.00 | 1.02 | -2.36 | -0.77 | -0.66 | 1.19 | 1.58 | ▁▃▇▁▇ |
First, we consider a correlation matrix where pairwise complete observations are allowed for use in computing correlations. Due to the low percentage of incomplete cells in the dataset, we prefer this dropping observations with any incomplete columns.
The correlation matrix below uses hierarchical clustering of the predictors to form 6 groups of variables using the corrplot
package. This was the most efficient way to gain insight on the related clusters of variables.
= cor(sdata_v1[,2:33], use = "pairwise.complete.obs")
M
::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 ) corrplot
We compare the correlations to the cleaned pre-processed training data.
= cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
Mpp ::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 ) corrplot
Our last step of preprocessing is to export the data sets. This need only but done once.
::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )
readr
::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )
readr
::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE ) readr
The above data files have the following properties and relationships to the original raw files:
Pre-Processed Data Files
filename | primary_key | response | numeric_columns | num_rows | file_source |
---|---|---|---|---|---|
sdata_train_pp.csv | id - consistent with sdata_test_pp | PH - unchanged | BoxCox, center, scaled, imputed | 2055 | StudentData.xlsx |
sdata_test_pp.csv | id - consistent with sdata_train_pp | PH - unchanged | BoxCox, center, scaled, imputed | 512 | StudentData.xlsx |
edata_test_pp.csv | id - standalone | PH - unchanged | BoxCox, center, scaled, imputed | 267 | StudentEvaluation.xlsx |
An important note about the id
column. The id
in the file edata_test_pp.csv
is the row number of the same observation in StudentEvaluation.xlsx
. But the same is not true for sdata_train_pp.csv
and sdata_test_pp.csv
because they are sampled from the source file StudentData.xlsx
. The 10th row of sdata_test_pp.csv
is not necessarily the 10th row of StudentData.xlsx
rather one should use the id
field as the corresponding row number of the original file.
We load the pre-processed datasets for training and testing in this section. We load the csv file for the training data set below and display the first few column characteristics from the file.
= read_csv("sdata_train_pp.csv",
sdata_train_pp col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
= read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
sdata_test_pp as_tibble() %>% remove_rownames()
= sdata_train_pp %>% select( -PH, -id )
sdata_trainX_pp = sdata_test_pp %>% select(-PH , -id) sdata_testX_pp
= TRUE # Set to TRUE if running the lengthy tuning process for all models tuningFull
We consider the same dataset and variable importance problem using gradient boosted trees. The gbm
library and variable importance measures are run within caret training procedure. The gradient boosting approach relies on a weak learner in an additive manner that minimizes the given loss function.
set.seed(1027)
<- trainControl(method = "repeatedcv",
gbmControlFull number = 10, # for debug mode, set this to a low number (1)
repeats = 10 ,
selectionFunction = "best",
verboseIter = FALSE)
<- trainControl(method = "repeatedcv",
gbmControlLight number = 5, # for debug mode, set this to a low number (1)
repeats = 1 ,
selectionFunction = "best",
verboseIter = FALSE)
if(tuningFull == TRUE)
{= gbmControlFull
gbmControl = expand.grid( n.trees = seq(250, 1000, by = 250),
gbmTune shrinkage = c( 0.02, 0.05, 0.1) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 15
)else {
} = gbmControlLight
gbmControl = expand.grid( n.trees = c( 2000),
gbmTune shrinkage = c( 0.02, 0.05 , 0.1 ) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 4
)
}
gbmTune = caret::train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "gbm",
tuneGrid = gbmTune ,
verbose = FALSE,
metric = "RMSE" ,
trControl = gbmControl ) )
## Stochastic Gradient Boosting
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.02 3 250 0.1434366 0.3445322 0.1130661
## 0.02 3 500 0.1399034 0.3781615 0.1093662
## 0.02 3 750 0.1381810 0.3948955 0.1075668
## 0.02 3 1000 0.1371433 0.4053153 0.1065292
## 0.02 7 250 0.1375662 0.3998266 0.1072462
## 0.02 7 500 0.1339808 0.4329093 0.1037283
## 0.02 7 750 0.1326817 0.4453548 0.1025127
## 0.02 7 1000 0.1320709 0.4515327 0.1019375
## 0.05 3 250 0.1391827 0.3846115 0.1085754
## 0.05 3 500 0.1370326 0.4073776 0.1063487
## 0.05 3 750 0.1364019 0.4163811 0.1056864
## 0.05 3 1000 0.1360931 0.4213869 0.1053766
## 0.05 7 250 0.1337931 0.4330732 0.1032574
## 0.05 7 500 0.1324935 0.4472241 0.1018857
## 0.05 7 750 0.1319768 0.4531566 0.1013501
## 0.05 7 1000 0.1317495 0.4562800 0.1010483
## 0.10 3 250 0.1379832 0.3995082 0.1072565
## 0.10 3 500 0.1371777 0.4135910 0.1063966
## 0.10 3 750 0.1368283 0.4201240 0.1059677
## 0.10 3 1000 0.1366484 0.4244918 0.1056588
## 0.10 7 250 0.1343586 0.4341016 0.1034324
## 0.10 7 500 0.1335822 0.4437819 0.1027288
## 0.10 7 750 0.1332925 0.4472134 0.1023291
## 0.10 7 1000 0.1332832 0.4482130 0.1022717
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 15
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 7, shrinkage = 0.05 and n.minobsinnode = 15.
As the model output indicates the greater tree depth outperforms the lower counts and the shrinkage value of 0.05 produces the best RMSE value. The shrinkage, or learning rate, typically works better with smaller values, yet require longer processing time.
## Saving 7 x 5 in image
The plot of the tuned model clearly shows the better performance with the increased max tree depth but also indicates the lower shrinkage value doesn’t guarantee a lower RMSE. For the max tree depth of 3 and 7, the lowest shrinkage value does not produce the lowest RMSE for that tree depth.
## Saving 7 x 5 in image
The variable importance plot shows the 100% importance of the MnfFlow
followed by BrandCode
as the only other variable with an importance above 50%. Usagecont
, OxygenFiller
, AlchRel
, and Temperature
round out the top 6 variables based on importance with scores greater than 25%.
Overall, the final Gradient Boosted model results in an RMSE of approximately 0.132 and an \(R^2\) less than 0.5.
Now we train a Cubist tree using the best
selectionFunction on RMSE. The Cubist
library and variable importance measures are run within caret training procedure.
The Cubist model is a rules-based approach in which the terminal leaves contains linear regression models. The models are based on predictions defined in the previous nodes of the tree.
set.seed(1027)
<- trainControl(method = "cv" , selectionFunction = "best")
cubistControlFull <- expand.grid( committees = c( 10, 50, 100 ) ,
tuneGridFull neighbors = c( 0, 1, 5, 9 )
)
<- trainControl(method = "cv" , selectionFunction = "best")
cubistControlLight <- expand.grid( committees = c( 10, 20 ) ,
tuneGridLight neighbors = c( 0, 5 ) )
if(tuningFull == TRUE)
{= cubistControlFull
cubistControl = tuneGridFull
cubistGrid else {
} = cubistControlLight
cubistControl = tuneGridLight
cubistGrid
}
cubistTune = caret::train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
## Cubist
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 10 0 0.10326126 0.6469501 0.07521436
## 10 1 0.10698108 0.6394174 0.07408043
## 10 5 0.09661241 0.6865169 0.06923965
## 10 9 0.09715807 0.6838488 0.06954861
## 50 0 0.10213076 0.6574062 0.07362497
## 50 1 0.10583349 0.6449193 0.07208654
## 50 5 0.09539318 0.6943587 0.06757449
## 50 9 0.09598916 0.6917813 0.06811918
## 100 0 0.10218813 0.6576706 0.07366847
## 100 1 0.10587327 0.6449503 0.07201655
## 100 5 0.09547621 0.6941947 0.06752479
## 100 9 0.09606335 0.6917983 0.06810842
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 5.
The Cubist model results provide a strong RMSE value and \(R^2\) value with generally a higher number of committees and neighbors.
The line plot confirms the better model per RMSE with neighbors at 5 and committees set to 50.
## Saving 7 x 5 in image
The variable importance plot identifies MnfFlow
as the most valuable variable followed by Balling
, BallingLvl
, AlchRel
, and PressureVacuum
to round out the top 5 variables. Overall, 7 variables resulted in over 50% importance indicating those variables are contained in over 50% of the rules defining the Cubist model.
## Saving 7 x 5 in image
The final Cubist model confirms the high usage rate of the variable MnfFlow
.
Cubist Predictor Usage
Conditions | Model | Variable |
---|---|---|
82 | 61 | MnfFlow |
18 | 81 | Balling |
22 | 71 | BallingLvl |
19 | 73 | AlchRel |
28 | 60 | PressureVacuum |
19 | 63 | OxygenFiller |
14 | 66 | Density |
21 | 51 | BowlSetpoint |
8 | 61 | HydPressure3 |
7 | 62 | Temperature |
9 | 59 | AirPressurer |
18 | 47 | CarbFlow |
6 | 58 | CarbRel |
3 | 54 | HydPressure2 |
3 | 52 | CarbPressure1 |
11 | 40 | Usagecont |
16 | 31 | FillerSpeed |
6 | 39 | HydPressure1 |
41 | 0 | BrandCode |
3 | 33 | FillerLevel |
0 | 27 | CarbPressure |
3 | 20 | CarbVolume |
2 | 21 | PCVolume |
0 | 20 | CarbTemp |
0 | 20 | PressureSetpoint |
0 | 17 | FillPressure |
0 | 16 | HydPressure4 |
0 | 16 | PSCFill |
0 | 6 | MFR |
0 | 5 | FillOunces |
0 | 4 | PSC |
0 | 3 | PSCCO2 |
The splits
plot indicates the usage of each variable by committee/rule. The values are normalized and highlight the high usage rate of MnfFlow
along with the use of greater than or less than inequalities for MnfFlow
.
## quartz_off_screen
## 2
The dotplot
allows us to see the distribution of the coefficient parameters for each linear model associated with a committee/rule.
When a predictor has more dots, it is used in more models which suggests the variable is influential. The color density is proportional to the variable importance. The sign of the coefficient predicts the marginal effect of the predictor on the response PH
. However, since Cubist uses many models, the concept of the sign
of a predictor coefficient can be unclear.
The x-axis value of the dot in the dotplot
indicates the true coefficient in the original units unlike the above plot which normalizes the x-axis.
## quartz_off_screen
## 2
The boxplot below which we devised constructs a distribution of coefficients for each predictor. We interpret the sign of the predictor \(X\) marginal effect to be the sign of the median of the distribution of coefficients pf \(X\) over those models which include \(X\). The predictors have been sorted by their median.
The plot provides clarity regarding the impact of a given variable as the median is a good estimate of the sign of the variable coefficient. The AlchRel
variable box indicates a positive relationship with the dependent variable PH
. The most important variable MnfFlow
box results in a negative coefficient, and thus an inverse effect on PH
. For all the rules in the final Cubist model, the understanding of the coefficient values for each variable provides a clear relationship to the dependent variable despite the obtuse nature of the Cubist model construction.
## Saving 7 x 5 in image
Overall, the final Cubist model results in an RMSE under 0.1 and a \(R^2\) just below of 0.7.
Using the rpart
library and its training methods, we consider the best CART based on the RMSE
metric. The CART approach follows a classification and regression tree methodology.
set.seed(10393)
library(rpart)
= expand.grid( maxdepth = seq( 1, 20, by = 1 ) )
cartGrid
<- trainControl(method = "boot",
cartControl number = 30,
selectionFunction = "best")
rpartTune = train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "rpart2" ,
metric = "RMSE" ,
tuneGrid = cartGrid ,
trControl = cartControl ) )
## CART
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Bootstrapped (30 reps)
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 0.1540934 0.2150028 0.1207013
## 2 0.1485972 0.2697600 0.1163125
## 3 0.1439126 0.3150026 0.1130389
## 4 0.1418976 0.3346403 0.1106321
## 5 0.1408007 0.3455111 0.1095038
## 6 0.1398046 0.3557469 0.1085050
## 7 0.1389994 0.3635874 0.1074293
## 8 0.1381955 0.3717156 0.1065335
## 9 0.1373228 0.3802125 0.1057161
## 10 0.1362384 0.3899605 0.1046780
## 11 0.1353910 0.3979319 0.1038336
## 12 0.1345705 0.4055253 0.1028965
## 13 0.1340562 0.4106769 0.1022712
## 14 0.1335638 0.4154690 0.1017284
## 15 0.1330891 0.4202569 0.1012114
## 16 0.1327996 0.4230057 0.1008957
## 17 0.1327006 0.4239743 0.1008136
## 18 0.1326532 0.4245372 0.1007342
## 19 0.1326119 0.4250877 0.1006977
## 20 0.1326119 0.4250877 0.1006977
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 19.
The model output shows increasing tree depth leads to improved RMSE and \(R^2\) values.
The below plot displays the improving RMSE value with the increasing tree depth with an optimal tree depth of 19.
## Saving 7 x 5 in image
Once again, the variable MnfFlow
is the most important variable with an importance rating of 100%. Six variables achieve an importance score of greater than 50% with BrandCode
, Temperature
, and CarbRel
each achieving greater than 75% importance.
## Saving 7 x 5 in image
The CART tree plot of the final model reaffirms the importance of the variable MnfFlow
as the initial node splits on MnfFlow
. One of the second level nodes is BrandCode
, another highly important variable to the CART model. Eight ff the top 10 most important variables in the Cubist model are found in the top 15 variables of the CART model. This overlap of variable importance shows a consistency across the different model approaches.
Overall, the final CART model results in an RMSE of approximately 0.132 and an \(R^2\) less than 0.5.
Now we train a MARS model based on the implementation of earth from the caret
library. The MARS approach is a non-linear regression model that does not rely on the original variables but instead on derived features. The tuned MARS model defines the derived features as linear combinations of the initial variables.
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598 RSS 27.25932 GRSq 0.5260713 RSq 0.5555922
## Call: earth(x=tbl_df[2055,32], y=c(8.36,8.26,8.9...), keepxy=TRUE, degree=2,
## nprune=27)
##
## coefficients
## (Intercept) 8.3958776
## h(-0.215889-MnfFlow) 0.1780017
## h(HydPressure3- -1.29159) 0.0391037
## h(-0.728732-Temperature) -0.0805473
## h(Temperature- -0.728732) -0.0477016
## h(-0.658787-Balling) 0.3588441
## h(-0.832915-OxygenFiller) -0.0445311
## h(OxygenFiller- -0.832915) -0.0417267
## h(BowlSetpoint- -1.34947) 0.0724302
## h(0.640105-AlchRel) 0.0395901
## h(AlchRel-0.640105) 0.1293063
## h(1.05286-MnfFlow) * BrandCodeC -0.0902482
## h(MnfFlow-1.05286) * BrandCodeC -1.2433037
## h(-0.215889-MnfFlow) * h(PressureVacuum-0.386629) -0.0625259
## h(-0.215889-MnfFlow) * h(0.386629-PressureVacuum) -0.1069156
## h(-0.215889-MnfFlow) * h(AirPressurer-0.839882) -0.1043474
## h(-0.215889-MnfFlow) * h(0.839882-AirPressurer) 0.0605090
## h(MnfFlow- -0.215889) * h(0.640105-AlchRel) -0.0669114
## h(1.75812-CarbPressure1) * h(BowlSetpoint- -1.34947) -0.0120925
## h(0.1987-HydPressure1) * h(-0.658787-Balling) -0.2036569
## h(0.21892-FillerSpeed) * h(Balling- -0.658787) 0.0158849
## h(FillerSpeed-0.21892) * h(Balling- -0.658787) 0.2165359
## h(Temperature-1.27402) * h(BowlSetpoint- -1.34947) 0.0275573
## h(Usagecont-0.0816887) * h(Balling- -0.658787) -0.0381250
## h(Density-0.482509) * h(BowlSetpoint- -1.34947) -0.0578527
## h(0.386629-PressureVacuum) * h(0.640105-AlchRel) 0.0204706
## h(OxygenFiller- -0.832915) * h(AirPressurer- -0.700718) 0.0305074
##
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598 RSS 27.25932 GRSq 0.5260713 RSq 0.5555922
The plot of the final MARS model indicates increasing the number of terms leads to a lower RMSE.
## Saving 7 x 5 in image
Once again, the variable MnfFlow
achieves an importance score of 100%. Only 3 variables reach the score of 50%, the aforementioned MnfFlow
, BrandCodeC
, and AlchRel
.
The variable MnfFlow
is influential because the variable effects many other variables in the final model. Due to the numerous linear combinations including MnfFlow
, the relationship to PH
is difficult to ascertain. The MnfFlow
variable requires more investigation into the variable interactions to determine the impact to PH
for the MARS model.
## Saving 7 x 5 in image
Overall, the MARS model reaffirms the importance of the variable MnfFlow
and also results in an RMSE and \(R^2\) in line with the previous models for Gradient Boosted Trees and CART.
We also run a simple OLS regression on PH in terms of the scaled and centered variables. While its predictive performance is weaker than the other models on the training set, the coefficients are much easier to interpret.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51932 -0.07897 0.00917 0.08626 0.44987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.586130 0.010428 823.364 < 2e-16 ***
## CarbVolume -0.005488 0.008389 -0.654 0.513047
## FillOunces -0.004105 0.003163 -1.298 0.194451
## PCVolume -0.006448 0.003678 -1.753 0.079740 .
## CarbPressure -0.003088 0.011767 -0.262 0.793033
## CarbTemp 0.008475 0.010712 0.791 0.428936
## PSC -0.003168 0.003147 -1.007 0.314160
## PSCFill -0.005762 0.003066 -1.879 0.060361 .
## PSCCO2 -0.006170 0.003020 -2.043 0.041221 *
## MnfFlow -0.089502 0.006520 -13.728 < 2e-16 ***
## CarbPressure1 0.030465 0.003680 8.278 2.25e-16 ***
## FillPressure 0.005601 0.004501 1.244 0.213528
## HydPressure1 0.004482 0.005177 0.866 0.386658
## HydPressure2 -0.027126 0.009963 -2.723 0.006533 **
## HydPressure3 0.059090 0.010506 5.624 2.12e-08 ***
## HydPressure4 0.001523 0.004624 0.329 0.741982
## FillerLevel -0.021768 0.008733 -2.493 0.012760 *
## FillerSpeed 0.003863 0.004923 0.785 0.432809
## Temperature -0.019234 0.003564 -5.397 7.59e-08 ***
## Usagecont -0.020299 0.003911 -5.191 2.31e-07 ***
## CarbFlow 0.012499 0.004515 2.768 0.005685 **
## Density -0.018774 0.009023 -2.081 0.037598 *
## MFR 0.002030 0.003898 0.521 0.602651
## Balling -0.091337 0.014856 -6.148 9.43e-10 ***
## PressureVacuum -0.016446 0.004588 -3.585 0.000345 ***
## OxygenFiller -0.015182 0.004217 -3.600 0.000326 ***
## BowlSetpoint 0.053190 0.009135 5.823 6.72e-09 ***
## PressureSetpoint -0.013892 0.004500 -3.087 0.002047 **
## AirPressurer -0.002649 0.003204 -0.827 0.408420
## AlchRel 0.034578 0.011814 2.927 0.003463 **
## CarbRel 0.005015 0.006593 0.761 0.447003
## BallingLvl 0.097856 0.017857 5.480 4.79e-08 ***
## BrandCodeA -0.095705 0.025430 -3.764 0.000172 ***
## BrandCodeC -0.139671 0.010233 -13.649 < 2e-16 ***
## BrandCodeD -0.043954 0.029625 -1.484 0.138054
## BrandCodeE -0.057782 0.015435 -3.744 0.000186 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1325 on 2019 degrees of freedom
## Multiple R-squared: 0.4223, Adjusted R-squared: 0.4123
## F-statistic: 42.16 on 35 and 2019 DF, p-value: < 2.2e-16
The most significant variables are MnfFlow
, CarbPressure
, HydPressure3
, Temperature
, Usagecont
, Balling
, PressureVacuum
, OxygenFiller
, BowlSetpoint
, BallingLvl
, and BrandCode
. Seven of these 11 variables are found in the top 10 most important variables of the Cubist model. Upon closer evaluation, the median value of the variable coefficients from the top 10 of the Cubist model match the sign of the coefficient value in the linear regression model. This finding further supports the assessment of the Cubist model variable impacts on the dependent variable PH
.
We summarize the plot of coefficients below.
## NULL
The diagnostic plots of the OLS model suggest that the histogram of residuals is relatively symmetric and looks normal.
The QQ plot suggests the model fit does not fully capture the fat tails possibly due to a small number of outliers.
Overall, the linear regression model does not perform as well as the Cubist model but does provide a valuable sanity check and baseline to the results from the more advanced models.
Based on the above results, we summarize their training and test RMSE and \(R^2\) below.
The joint results are tabulated below.
# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()
= list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)
model_list
= data.frame()
model_stats
for( modelObj in model_list )
{if( modelObj$method != 'lm' ){
<- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
pred_data
}else
{<- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
pred_data
}<- data.frame( modelName = modelObj$method ,
output trainRMSE = getTrainPerf(modelObj)[1,1] ,
trainR2 = getTrainPerf(modelObj)[1,2] ,
testRMSE = caret::RMSE( pred_data, sdata_test_pp$PH) ,
testR2 = caret::R2( pred_data, sdata_test_pp$PH)
)
<- rbind(model_stats, output)
model_stats }
Model Results Comparison
modelName | trainRMSE | trainR2 | testRMSE | testR2 |
---|---|---|---|---|
cubist | 0.095 | 0.694 | 0.090 | 0.722 |
rpart2 | 0.133 | 0.425 | 0.125 | 0.474 |
lm | 0.134 | 0.396 | 0.127 | 0.449 |
gbm | 0.132 | 0.456 | 0.151 | 0.342 |
earth | 0.125 | 0.486 | 0.153 | 0.279 |
## save_kable will have the best result with magick installed.
We also inspect the plot of the predicted versus the observed data for the 5 models. The scatterplots below confirm the Cubist model outperforms the others on both the training set in the upper row and also the test data on the lower rows. The distribution of the Cubist plot mostly closely aligns with the 45-degree line which corresponds to an accurate prediction.
We conclude that the best model is Cubist but the multilinear regression model, MARS and CART models provide useful insight on the main predictors.
The top 5 variables that impact PH are:
MnfFlow
- this appears negatively related to PH
level based on OLS, MARS and Cubist median plot.
BallingLvl
is positively related to PH
level.
PreserveVacuum
is negatively related to PH.
AlchRel
is positively related to PH
.
Balling
is negatively related to PH
.
Lastly, we observe that Brand is probably not a manufacturing predictor that can be controlled. Mostly likely Brand refers to a recipe for the final product. However, knowing the desired optimal PH
for each Brand may help us control the most influential variables that control PH
, thereby providing better quality control. While there are over 30 predictors in the dataset, we recommend looking at the top 10 most influential predictors. Most of the models agree substantially on which predictors to include in the top 10.
For example, the linear regression model coefficients agree with the sign and variable importance of the top 10 predictors of the Cubist model. Importance predictors of the OLS model have the highest statistical significance - consistent with variable importance measure of caret
.
## Cubist
##
## 2567 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2310, 2309, 2310, 2310, 2310, 2311, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 10 0 0.09786899 0.6814835 0.07063050
## 10 1 0.10347326 0.6605109 0.07157433
## 10 5 0.09205621 0.7148932 0.06565672
## 10 9 0.09275054 0.7104184 0.06629558
## 50 0 0.09733927 0.6877932 0.07034005
## 50 1 0.10284267 0.6632483 0.07070141
## 50 5 0.09105650 0.7209833 0.06455347
## 50 9 0.09179098 0.7169196 0.06532125
## 100 0 0.09682465 0.6925441 0.07016957
## 100 1 0.10246532 0.6656340 0.07043498
## 100 5 0.09042487 0.7248834 0.06419195
## 100 9 0.09121584 0.7206745 0.06503557
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 100 and neighbors = 5.
RMSE of the full StudentData dataset: 0.0904249.
\(R^2\) of the full StudentData dataset: 0.7248834.
We summarize all the R code used in this project in this appendix for ease of reading.
::opts_chunk$set(echo = FALSE)
knitr# Conditional Code Evaluation is managed here.
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(rpart.plot))
suppressPackageStartupMessages(library(partykit))
suppressPackageStartupMessages(library(gbm))
library(cowplot)
library(skimr)
library(GGally) # for ggpairs
library(readxl) # to parse Excel workbooks
library(openxlsx) # for writing xlsx files
library(corrplot)
library(RColorBrewer)
library(caret)
library(VSURF)
library(Cubist)
library(rpart)
library(party)
library(writexl)
-bg {
.code-color: lightgray;
background
}
<- read_excel("StudentData.xlsx")
raw_StudentData dim(raw_StudentData)
skim(raw_StudentData) %>%
yank("numeric") %>%
arrange( complete_rate) %>%
kable( digits = 3 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(raw_StudentData) %>% yank("character") %>%
kable( caption = "Character Variables in Beverage Data" , digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
%>%
raw_StudentData rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>%
filter( !is.na(PH)) %>% as_tibble() -> sdata_v1
dim(sdata_v1)
# include any rows where any column is NA. Always include id column
%>% dplyr::filter_all( any_vars(is.na(.))) -> x1
sdata_v1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
%>% select_if( ~ any(is.na(.) ) ) -> x2
x1
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
%>% mutate( id = x1$id ) %>% relocate(id)-> missing_train_rows
x2
dim(missing_train_rows)
%>% mutate( num_na = rowSums(is.na(.))) -> missing_train_summary
missing_train_rows
%>% group_by(num_na) %>%
missing_train_summary summarize( count = n()) %>%
kable( caption = "Number of Rows with NA") %>%
kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
= sum( missing_train_summary$num_na)
num_na_cells = nrow(sdata_v1) * 32 # excludes the response variable PH since we dropped its values.
num_cells
print(paste0("There are ", num_na_cells , " cells with NA values out of ", num_cells, " equivalent to: " , sprintf("%.2f%%", 100 * num_na_cells / num_cells ) ) )
<- read_excel("StudentEvaluation.xlsx")
raw_StudentEvaluation dim(raw_StudentEvaluation)
skim(raw_StudentEvaluation) %>%
yank("numeric") %>%
arrange( complete_rate) %>%
kable( digits = 3 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(raw_StudentEvaluation) %>% yank("character") %>%
kable( caption = "Character Variables in Beverage Data" , digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
%>%
raw_StudentEvaluation rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>% as_tibble() -> edata_v1
dim(edata_v1)
# include any rows where any column is NA. Always include id column
#
# NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
# for all observations in the test set.
%>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1
edata_v1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
%>% select_if( ~ any(is.na(.) ) ) -> ex2
ex1
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
%>% mutate( id = ex1$id ) %>% relocate(id)-> missing_test_rows
ex2
dim(missing_test_rows)
%>% mutate( num_na = rowSums(is.na(.))) -> missing_test_summary
missing_test_rows
%>% group_by(num_na) %>%
missing_test_summary summarize( count = n()) %>%
kable( caption = "Number of Rows with NA") %>%
kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
= sum( missing_test_summary$num_na)
num_na_cells_test = nrow(edata_v1) * 32 # excludes the response variable PH since we dropped its values.
num_cells_test
print(paste0("There are ", num_na_cells_test , " cells with NA values (excluding PH) out of ", num_cells_test, " equivalent to: " ,
sprintf("%.2f%%", 100 * num_na_cells_test / num_cells_test ) ) )
= skim(sdata_v1)
skim_sdata = skim(edata_v1)
skim_edata
%>% select( skim_variable, n_missing, complete_rate) %>% inner_join(skim_edata %>% select(skim_variable, n_missing, complete_rate), by = c("skim_variable") ) %>% filter(skim_variable != 'PH' ) -> skim_complete_rates
skim_sdata
%>% rename( training_complete = complete_rate.x, testing_complete = complete_rate.y) -> skim_complete_rates
skim_complete_rates %>% as_tibble() %>% ggplot(aes(x=training_complete, y = testing_complete)) +
skim_complete_rates geom_point() + geom_smooth(method = "lm") + theme(aspect.ratio = 1) + lims( x = c(0.95, 1), y = c(0.95, 1)) +
labs(title = "Completion Rate by Predictor between Training and Test Sets")
# Feature plot for the numeric predictor variables against the result variable PH
<- sdata_v1 %>%
cols select(-c('id', 'BrandCode', 'PH')) %>% colnames()
featurePlot(sdata_v1[,cols],
$PH,
sdata_v1plot="scatter",
type = c("p", "smooth"),
span = .5,
layout=c(6,6))
#https://statisticsglobe.com/histogram-density-for-each-column-data-frame-r
<- sdata_v1[,cols] %>%
data_long pivot_longer(cols) %>%
as.data.frame()
<- ggplot(data_long, aes(x = value)) +
ggp2 geom_density() +
facet_wrap(~ name, scales="free") +
labs(title = "Density Plot of Predictor Variables")
ggp2# http://www.sthda.com/english/wiki/normality-test-in-r
# From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
<- lapply(sdata_v1[,cols], shapiro.test)
shap_test_res # Result: None of the numeric variables are normally distributed.
# https://stackoverflow.com/questions/62306712/how-to-select-only-the-p-value-0-05-after-performing-shapiro-wilk-test-in-rstud
<- sapply(shap_test_res, function(x) x$p.value > .05)
subset_vector <- shap_test_res[subset_vector]
results_subset
length(results_subset)
ggplot(data = sdata_v1) +
geom_bar(mapping = aes(x = `BrandCode`)) +
labs(title = "Distribution of Brand Code")
ggplot(data = sdata_v1, mapping = aes(x = `BrandCode`, y = PH)) +
geom_boxplot() +
labs(title = "Boxplot of Brand Code")
<- sdata_v1 %>% drop_na()
sdata_v1_no_nas
<- VSURF(sdata_v1_no_nas[,cols],
bev.vsurf $PH,
sdata_v1_no_nasntree = 10,
nfor.thres = 20,
nfor.interp = 10, nfor.pred = 10)
bev.vsurfsummary(bev.vsurf)
$varselect.pred
bev.vsurfset.seed(19372)
= createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )
train_indices
= sdata_v1[-train_indices , ]
sdata_test = sdata_v1[ train_indices , ]
sdata_train = sdata_v1$PH[-train_indices ]
sdataY_test = sdata_v1$PH[ train_indices ]
sdataY_train
nearZeroVar(sdata_train, saveMetrics = TRUE) %>%
kable(caption = "No Near Zero Variance Predictors in Training", digits = 2 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(sdata_train) %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
%>%
) filter( zscore_extreme > 4 , skim_variable != 'PH') %>%
arrange( desc(zscore_extreme)) %>%
::select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd, numeric.hist, zscore_min, zscore_max ) %>%
dplyrkable(digits = 2, caption = "Predictors with Outliers by Z-Score" ) %>%
kable_styling(bootstrap_options = c("hover" , "striped"), position = "left")
= sdata_train %>% ggplot(aes(x=PH, y = MFR)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl1 = sdata_train %>% ggplot(aes(x=PH, y = OxygenFiller)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl2 = sdata_train %>% ggplot(aes(x=PH, y = Temperature)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl3 = sdata_train %>% ggplot(aes(x=PH, y = CarbRel)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl4 = sdata_train %>% ggplot(aes(x=PH, y = AirPressurer)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl5 = sdata_train %>% ggplot(aes(x=PH, y = PSCCO2)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl6 = sdata_train %>% ggplot(aes(x=PH, y = FillPressure)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl7
= list(pl1, pl2, pl3, pl4, pl5, pl6, pl7)
plotlistplot_grid(plotlist=plotlist, nrow = 2 )
= median( sdata_train$MFR , na.rm = TRUE )
median_MFR
%>%
sdata_train mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> sdata_v2_train
%>%
sdata_test mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> sdata_v2_test
%>%
edata_v1 mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> edata_v2
# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected. They need to be cast as data frames.
# ---------------------------------------------------------------------------------
= preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale", "knnImpute") )
preProcFunc
# Becomes the source data for the model building
= predict( preProcFunc, as.data.frame(sdata_v2_train ) )
sdata_train_pp
# Becomes the final version of test data for validation
= predict( preProcFunc, as.data.frame(sdata_v2_test) )
sdata_test_pp
# Need to generate predictions based on test data without known response
= predict( preProcFunc, as.data.frame(edata_v2) )
edata_test_pp
= sdata_train_pp %>% select( -PH)
sdata_trainX_pp = sdata_test_pp %>% select(-PH)
sdata_testX_pp = edata_test_pp %>% select( -PH)
edata_testX_pp
skim(sdata_train_pp)
skim(sdata_test_pp)
skim(edata_test_pp)
= cor(sdata_v1[,2:33], use = "pairwise.complete.obs")
M
::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )
corrplot= cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
Mpp ::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )
corrplot
::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )
readr
::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )
readr
::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )
readr
= tibble( filename = c("sdata_train_pp.csv", "sdata_test_pp.csv", "edata_test_pp.csv"),
file_df primary_key = c("id - consistent with sdata_test_pp", "id - consistent with sdata_train_pp", "id - standalone") ,
response = c("PH - unchanged", "PH - unchanged", "PH - unchanged" ) ,
numeric_columns = c("BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed") ,
num_rows = c(2055, 512, 267 ) ,
file_source = c("StudentData.xlsx", "StudentData.xlsx", "StudentEvaluation.xlsx")
)
%>% kable(caption = "Pre-Processed Data Files") %>% kable_styling(bootstrap_options = c("hover", "striped"
file_df
), position = "left")
= read_csv("sdata_train_pp.csv",
sdata_train_pp col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
= read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
sdata_test_pp as_tibble() %>% remove_rownames()
= sdata_train_pp %>% select( -PH, -id )
sdata_trainX_pp = sdata_test_pp %>% select(-PH , -id)
sdata_testX_pp = TRUE # Set to TRUE if running the lengthy tuning process for all models
tuningFull set.seed(1027)
<- trainControl(method = "repeatedcv",
gbmControlFull number = 10, # for debug mode, set this to a low number (1)
repeats = 10 ,
selectionFunction = "best",
verboseIter = FALSE)
<- trainControl(method = "repeatedcv",
gbmControlLight number = 5, # for debug mode, set this to a low number (1)
repeats = 1 ,
selectionFunction = "best",
verboseIter = FALSE)
if(tuningFull == TRUE)
{= gbmControlFull
gbmControl = expand.grid( n.trees = seq(250, 1000, by = 250),
gbmTune shrinkage = c( 0.02, 0.05, 0.1) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 15
)else {
} = gbmControlLight
gbmControl = expand.grid( n.trees = c( 2000),
gbmTune shrinkage = c( 0.02, 0.05 , 0.1 ) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 4
)
}
gbmTune = caret::train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "gbm",
tuneGrid = gbmTune ,
verbose = FALSE,
metric = "RMSE" ,
trControl = gbmControl ) )
<- ggplot(gbmTune) + labs(title = "Gradient Boosted Trees - Tuning")
gbm_plot
gbm_plotggsave("GradientBoostedTreesTurning.jpg", gbm_plot)
<- ggplot(varImp(gbmTune, numTrees = gbmTune$finalModel$n.trees) ) + labs(title = "Gradient Boosted Trees - Variance Importance")
gbm_vi_plot
gbm_vi_plotggsave("GbmVarImp.jpg", gbm_vi_plot)
set.seed(1027)
<- trainControl(method = "cv" , selectionFunction = "best")
cubistControlFull <- expand.grid( committees = c( 10, 50, 100 ) ,
tuneGridFull neighbors = c( 0, 1, 5, 9 )
)
<- trainControl(method = "cv" , selectionFunction = "best")
cubistControlLight <- expand.grid( committees = c( 10, 20 ) ,
tuneGridLight neighbors = c( 0, 5 ) )
if(tuningFull == TRUE)
{= cubistControlFull
cubistControl = tuneGridFull
cubistGrid else {
} = cubistControlLight
cubistControl = tuneGridLight
cubistGrid
}
cubistTune = caret::train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
<- ggplot(cubistTune) + labs(title="Cubist Model Tuning")
cubist_plot
cubist_plotggsave("CubistTuning.jpg", cubist_plot)
<- ggplot(varImp(cubistTune), top = 20) + labs(title = "Cubist Model Variable Importance")
cubist_vi_plot
cubist_vi_plotggsave("CubistVarImp.jpg", cubist_vi_plot)
$finalModel$usage %>% arrange( desc(Model+ Conditions)) %>%
cubistTunekable(caption = "Cubist Predictor Usage") %>% kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
<- dotplot( cubistTune$finalModel, what = "splits" )
cubist_splitsplot
cubist_splitsplot# Output to file
jpeg(file="CubistSplitsPlot.jpg")
cubist_splitsplotdev.off()
<- dotplot( cubistTune$finalModel,
cubist_dotplot what = "coefs",
between = list(x = 0.2, y = 0.5) ,
scales = list(x = list(relation = "free"),
y = list(cex = 0.25) ) )
cubist_dotplotjpeg(file="CubistDotPlot.jpg")
cubist_splitsplotdev.off()
$finalModel$coefficients %>%
cubistTunerownames_to_column(var="id") %>%
select(-committee, -rule) %>%
pivot_longer(!id, names_to = "predictor", values_to = "coef" ) %>%
filter( !is.na(coef)) %>%
filter( predictor != '(Intercept)' ) %>%
filter(abs(coef) < 20 ) -> coef_piv
<- coef_piv %>%
cubist_boxplot ggplot() +
geom_boxplot(
aes(x=reorder(predictor, coef, FUN = median, na.rm = TRUE ), # order the boxplots by median
y = coef ), outlier.shape = NA ) + # strip out the outliers
coord_flip(ylim=c(-.5,.5)) +
labs(title = "Distribution of Coefficients of Predictors",
subtitle = "from Cubist Committee/Rules",
x = "Predictors" ,
y = "Coefficient in Linear Model"
)
cubist_boxplotggsave("CubistBoxPlot.jpg", cubist_boxplot)
set.seed(10393)
library(rpart)
= expand.grid( maxdepth = seq( 1, 20, by = 1 ) )
cartGrid
<- trainControl(method = "boot",
cartControl number = 30,
selectionFunction = "best")
rpartTune = train( x = sdata_trainX_pp,
(y = sdata_train_pp$PH ,
method = "rpart2" ,
metric = "RMSE" ,
tuneGrid = cartGrid ,
trControl = cartControl ) )
<- ggplot(rpartTune) + labs(title="CART Tree Tuning")
cart_tuning_plot
cart_tuning_plotggsave("CartTuningPlot.jpg", cart_tuning_plot)
<- ggplot(varImp( rpartTune), top = 20 ) + labs(title = "CART Tree Variable Importance")
cart_vi_plot
cart_vi_plotggsave("CartVarImp.jpg", cart_vi_plot)
rpart.plot(rpartTune$finalModel, digits = 3)
if( tuningFull == TRUE )
{= expand.grid( .degree = 1:2, .nprune = seq(2,27, by=5) )
marsGrid = trainControl(method = "cv",
marsControl number = 10,
selectionFunction = "best",
verboseIter = FALSE)
else {
} = expand.grid( .degree = 1:2, .nprune = seq(2,25, by = 10 ) )
marsGrid
= trainControl(method = "cv",
marsControl number = 10,
selectionFunction = "best",
verboseIter = FALSE)
}
set.seed(1000)
= caret::train(x = sdata_trainX_pp ,
marsTuned y = sdata_train_pp$PH,
method = "earth" ,
metric = "RMSE" ,
tuneGrid = marsGrid ,
trControl = marsControl )
$finalModel
marsTuned
summary(marsTuned)
<- ggplot(marsTuned) + labs(title="MARS Model Tuning" )
mars_tuning_plot
mars_tuning_plotggsave("MarsTuningPlot.jpg", mars_tuning_plot)
<- ggplot( varImp(marsTuned), 20 ) + labs(title = "MARS Model Variable Importance")
mars_vi_plot
mars_vi_plotggsave("MarsVarImp.jpg", mars_vi_plot)
set.seed(123)
<- trainControl( method = "cv", number = 5)
train.control
= train( PH ~ . -id,
linearTune data = sdata_train_pp,
method = "lm",
tuneGrid = data.frame(intercept = TRUE),
trControl = train.control)
summary(linearTune$finalModel)
par(mfrow=c(2,2))
<- plot(linearTune$finalModel)
lm_final_model_plot
lm_final_model_plot#ggsave("LinearModelFinalPlot.jpg", lm_final_model_plot)
hist(linearTune$finalModel$residuals)
# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()
= list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)
model_list
= data.frame()
model_stats
for( modelObj in model_list )
{if( modelObj$method != 'lm' ){
<- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
pred_data
}else
{<- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
pred_data
}<- data.frame( modelName = modelObj$method ,
output trainRMSE = getTrainPerf(modelObj)[1,1] ,
trainR2 = getTrainPerf(modelObj)[1,2] ,
testRMSE = caret::RMSE( pred_data, sdata_test_pp$PH) ,
testR2 = caret::R2( pred_data, sdata_test_pp$PH)
)
<- rbind(model_stats, output)
model_stats
}
<- model_stats %>% as_tibble() %>%
results_table arrange( testRMSE) %>%
kable( digits = 3, caption = "Model Results Comparison") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
results_table
save_kable(results_table, "ModelResultsTable.jpg")
= extractPrediction(list(marsTuned, rpartTune, cubistTune, gbmTune), testX = sdata_testX_pp , testY = sdata_test_pp$PH )
predVals
# Attempting to hack into the extract prediction object with the LM data
# Cols: obs pred model dataType object
= predict(linearTune, newdata = sdata_test_pp)
pred_lm <- linearTune$finalModel$fitted.values
pred_lm_train
# Create placeholder data for the training values
<- rep("LM", length = 2055)
model_l <- rep("Training", length = 2055)
dt_l <- rep("Object9", length = 2055)
obj_l
# Create dataframe of the training data of LM
<- as.data.frame(cbind(obs=sdata_train_pp$PH, pred=pred_lm_train, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)
lm_train_preds
# Create placeholder data for the test values
<- rep("LM", length = 512)
model_l <- rep("Test", length = 512)
dt_l <- rep("Object10", length = 512)
obj_l
# Create dataframe of the test data of LM
<- as.data.frame(cbind(obs=sdata_test_pp$PH, pred=pred_lm, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)
lm_test_preds
# Combine all data for plotting
<- rbind(predVals, lm_train_preds, lm_test_preds)
all_preds
# Ensure obs and preds are numerics
$obs <- as.numeric(all_preds$obs)
all_preds$pred <- as.numeric(all_preds$pred)
all_preds
# Plot using Caret built-in reporting.
# --------------------------------------
<- plotObsVsPred(all_preds)
obsVsPredsPlot
obsVsPredsPlot#ggsave("ObsVsPredsPlot.jpg", obsVsPredsPlot)
# Combine the initial training and test datasets into 1
<- rbind(sdata_train_pp, sdata_test_pp)
sdata_trainfull_pp <- sdata_trainfull_pp %>% select(-PH , -id)
sdata_trainfullX_pp
= read_csv("edata_test_pp.csv",
edata_test_pp col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
# Train the Cubist model
set.seed(1027)
if(tuningFull == TRUE)
{= cubistControlFull
cubistControl = tuneGridFull
cubistGrid else {
} = cubistControlLight
cubistControl = tuneGridLight
cubistGrid
}
cubistTune = caret::train( x = sdata_trainfullX_pp,
(y = sdata_trainfull_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
# Make predictions on the provided dataset
<- as.numeric( predict( cubistTune, newdata=edata_test_pp ))
final_pred_data
# Output prediction performance
<- getTrainPerf(cubistTune)[1,1]
full_cubist_train_rmse <- getTrainPerf(cubistTune)[1,2]
full_cubist_train_r2 # Output predictions to Excel file
<- read_excel("StudentEvaluation.xlsx")
raw_StudentEvaluation $PH <- final_pred_data
raw_StudentEvaluationwrite_xlsx(raw_StudentEvaluation, "ANg_PTanofsky_Ph_Predictions.xlsx")