This technical report tunes multiple models for prediction of beverage PH and identifies determinants of PH among predictors in the manufacturing process. We train and test 5 models for this purpose including the Cubist, Gradient Boosted Tree, CART, MARS and linear regression models.
The training and test performance are evaluated separately on the StudentData dataset with the metrics RMSE and \(R^2\) to determine the model algorithm most appropriate for predicting the PH of the StudentEvaluation dataset.
We will describe the data wrangling steps in brief before doing any exploratory data analysis and model building. After loading the raw data files, our transformations and changes are as follows:
id to allow unique identification of all raw observations.PH.PH value into a 80/20 train and validate data set. We recognize that a test data set exists but this is for external assessment of the project prediction accuracy.BrandCode categorical value with a dummy BrandCode EMFR missing value with the training set median.Along the way, we will assess the numerical impact of such changes to the dataset.
## [1] 2571 33
Let’s evaluate the data characteristics of the raw file after loading. We use the skimr library to get statistics quickly for the entire training data set. The table below shows the numeric variables sorted by their completion rate in descending order. So the most problematic predictors are at the top of the table.
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| MFR | 212 | 0.918 | 704.049 | 73.898 | 31.400 | 706.300 | 724.000 | 731.000 | 868.600 | ▁▁▁▂▇ |
| Filler Speed | 57 | 0.978 | 3687.199 | 770.820 | 998.000 | 3888.000 | 3982.000 | 3998.000 | 4030.000 | ▁▁▁▁▇ |
| PC Volume | 39 | 0.985 | 0.277 | 0.061 | 0.079 | 0.239 | 0.271 | 0.312 | 0.478 | ▁▃▇▂▁ |
| PSC CO2 | 39 | 0.985 | 0.056 | 0.043 | 0.000 | 0.020 | 0.040 | 0.080 | 0.240 | ▇▅▂▁▁ |
| Fill Ounces | 38 | 0.985 | 23.975 | 0.088 | 23.633 | 23.920 | 23.973 | 24.027 | 24.320 | ▁▂▇▂▁ |
| PSC | 33 | 0.987 | 0.085 | 0.049 | 0.002 | 0.048 | 0.076 | 0.112 | 0.270 | ▆▇▃▁▁ |
| Carb Pressure1 | 32 | 0.988 | 122.586 | 4.743 | 105.600 | 119.000 | 123.200 | 125.400 | 140.200 | ▁▃▇▂▁ |
| Hyd Pressure4 | 30 | 0.988 | 96.289 | 13.123 | 52.000 | 86.000 | 96.000 | 102.000 | 142.000 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.989 | 68.190 | 3.538 | 57.000 | 65.600 | 68.200 | 70.600 | 79.400 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.990 | 141.095 | 4.037 | 128.600 | 138.400 | 140.800 | 143.800 | 154.000 | ▁▅▇▃▁ |
| PSC Fill | 23 | 0.991 | 0.195 | 0.118 | 0.000 | 0.100 | 0.180 | 0.260 | 0.620 | ▆▇▃▁▁ |
| Fill Pressure | 22 | 0.991 | 47.922 | 3.178 | 34.600 | 46.000 | 46.400 | 50.000 | 60.400 | ▁▁▇▂▁ |
| Filler Level | 20 | 0.992 | 109.252 | 15.698 | 55.800 | 98.300 | 118.400 | 120.000 | 161.200 | ▁▃▅▇▁ |
| Hyd Pressure2 | 15 | 0.994 | 20.961 | 16.386 | 0.000 | 0.000 | 28.600 | 34.600 | 59.400 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.994 | 20.458 | 15.976 | -1.200 | 0.000 | 27.600 | 33.400 | 50.000 | ▇▁▃▇▁ |
| Temperature | 14 | 0.995 | 65.968 | 1.383 | 63.600 | 65.200 | 65.600 | 66.400 | 76.200 | ▇▃▁▁▁ |
| Oxygen Filler | 12 | 0.995 | 0.047 | 0.047 | 0.002 | 0.022 | 0.033 | 0.060 | 0.400 | ▇▁▁▁▁ |
| Pressure Setpoint | 12 | 0.995 | 47.615 | 2.039 | 44.000 | 46.000 | 46.000 | 50.000 | 52.000 | ▁▇▁▆▁ |
| Hyd Pressure1 | 11 | 0.996 | 12.438 | 12.433 | -0.800 | 0.000 | 11.400 | 20.200 | 58.000 | ▇▅▂▁▁ |
| Carb Volume | 10 | 0.996 | 5.370 | 0.106 | 5.040 | 5.293 | 5.347 | 5.453 | 5.700 | ▁▆▇▅▁ |
| Carb Rel | 10 | 0.996 | 5.437 | 0.129 | 4.960 | 5.340 | 5.400 | 5.540 | 6.060 | ▁▇▇▂▁ |
| Alch Rel | 9 | 0.996 | 6.897 | 0.505 | 5.280 | 6.540 | 6.560 | 7.240 | 8.620 | ▁▇▂▃▁ |
| Usage cont | 5 | 0.998 | 20.993 | 2.978 | 12.080 | 18.360 | 21.790 | 23.755 | 25.900 | ▁▃▅▃▇ |
| PH | 4 | 0.998 | 8.546 | 0.173 | 7.880 | 8.440 | 8.540 | 8.680 | 9.360 | ▁▅▇▂▁ |
| Mnf Flow | 2 | 0.999 | 24.569 | 119.481 | -100.200 | -100.000 | 65.200 | 140.800 | 229.400 | ▇▁▁▇▂ |
| Carb Flow | 2 | 0.999 | 2468.354 | 1073.696 | 26.000 | 1144.000 | 3028.000 | 3186.000 | 5104.000 | ▂▅▆▇▁ |
| Bowl Setpoint | 2 | 0.999 | 109.327 | 15.303 | 70.000 | 100.000 | 120.000 | 120.000 | 140.000 | ▁▂▃▇▁ |
| Density | 1 | 1.000 | 1.174 | 0.378 | 0.240 | 0.900 | 0.980 | 1.620 | 1.920 | ▁▅▇▂▆ |
| Balling | 1 | 1.000 | 2.198 | 0.931 | -0.170 | 1.496 | 1.648 | 3.292 | 4.012 | ▁▇▇▁▇ |
| Balling Lvl | 1 | 1.000 | 2.050 | 0.870 | 0.000 | 1.380 | 1.480 | 3.140 | 3.660 | ▁▇▂▁▆ |
| Pressure Vacuum | 0 | 1.000 | -5.216 | 0.570 | -6.600 | -5.600 | -5.400 | -5.000 | -3.600 | ▂▇▆▂▁ |
| Air Pressurer | 0 | 1.000 | 142.834 | 1.212 | 140.800 | 142.200 | 142.600 | 143.000 | 148.200 | ▅▇▁▁▁ |
Next, we display the one character variable Brand below and its skimr output below.
Character Variables in Beverage Data
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.953 | 1 | 1 | 0 | 4 | 0 |
After loading the raw Excel file, we find the training dataset has 2571 observations with 32 predictors and 1 response variable PH.
General observations about the skimr results:
MFR is missing the most data: about 8% (212) of its observations are missing. The percentage of missing observations is too high for automated imputation in our opinion.Filler Speed is missing the next most data: about 2.2% (57) of its observations are missing. The percentage is low enough for us to consider imputation.MFR, Brand, all of the other predictors have data completeness rates over 97.8%. Therefore, except for Brand and MFR, an automated data imputation policy can be considered for those columns.PH has 4 missing observations. We will decide to drop those 4 observations as we do not wish to impute the response.Our first transformation below is to add a unique identifier id based on row number to obtain a primary key.
Then we drop the 4 observations which are missing PH values, rearrange the response variable PH and BrandCode for our convenience. Lastly, we also eliminate spaces from the predictor column names to avoid using quote in our code.
raw_StudentData %>%
rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>%
filter( !is.na(PH)) %>% as_tibble() -> sdata_v1
dim(sdata_v1)## [1] 2567 34
Is there any observations with a high number of missing observations?
# include any rows where any column is NA. Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id ) %>% relocate(id)-> missing_train_rows
dim(missing_train_rows)## [1] 529 28
We find there are 529 observations with NA observations. So 20.6% have NA values. We conclude that the NA values are sufficiently well distributed that removal of all observations with NA values is impractical.
In the marginal table below, we investigate if certain observations have a high incidence of NA values amongst their predictors. The answer shows that the incident of multiple
Number of Rows with NA
| num_na | count |
|---|---|
| 1 | 345 |
| 2 | 119 |
| 3 | 45 |
| 4 | 13 |
| 5 | 3 |
| 6 | 2 |
| 7 | 1 |
| 8 | 1 |
## [1] "There are 812 cells with NA values out of 82144 equivalent to: 0.99%"
Overall, only 1 percent of cells have missing values.
The concentration of missing values seems to decline at a geometric ratio suggesting independence among the occurrence of missing values.
## [1] 267 33
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| MFR | 31 | 0.884 | 697.801 | 96.400 | 15.600 | 707.000 | 724.600 | 731.450 | 784.800 | ▁▁▁▁▇ |
| Filler Speed | 10 | 0.963 | 3581.385 | 911.187 | 1006.000 | 3812.000 | 3978.000 | 3996.000 | 4020.000 | ▁▁▁▁▇ |
| Fill Ounces | 6 | 0.978 | 23.970 | 0.076 | 23.747 | 23.920 | 23.967 | 24.013 | 24.200 | ▁▅▇▃▁ |
| PSC | 5 | 0.981 | 0.085 | 0.053 | 0.004 | 0.044 | 0.076 | 0.112 | 0.246 | ▆▇▃▂▁ |
| PSC CO2 | 5 | 0.981 | 0.051 | 0.038 | 0.000 | 0.020 | 0.040 | 0.060 | 0.240 | ▇▃▂▁▁ |
| PC Volume | 4 | 0.985 | 0.278 | 0.063 | 0.099 | 0.233 | 0.275 | 0.322 | 0.464 | ▁▆▇▅▁ |
| Carb Pressure1 | 4 | 0.985 | 123.040 | 4.420 | 113.000 | 120.200 | 123.400 | 125.500 | 136.000 | ▃▃▇▂▁ |
| Hyd Pressure4 | 4 | 0.985 | 97.840 | 13.921 | 68.000 | 90.000 | 98.000 | 104.000 | 140.000 | ▅▆▇▂▁ |
| PSC Fill | 3 | 0.989 | 0.190 | 0.113 | 0.020 | 0.100 | 0.180 | 0.260 | 0.620 | ▇▇▃▁▁ |
| Oxygen Filler | 3 | 0.989 | 0.047 | 0.050 | 0.002 | 0.020 | 0.034 | 0.054 | 0.398 | ▇▁▁▁▁ |
| Alch Rel | 3 | 0.989 | 6.907 | 0.498 | 6.400 | 6.540 | 6.580 | 7.180 | 7.820 | ▇▁▂▁▃ |
| Fill Pressure | 2 | 0.993 | 48.141 | 3.440 | 37.800 | 46.000 | 47.800 | 50.200 | 60.200 | ▁▇▇▂▁ |
| Filler Level | 2 | 0.993 | 110.292 | 15.502 | 69.200 | 100.600 | 118.600 | 120.200 | 153.200 | ▂▃▇▇▁ |
| Temperature | 2 | 0.993 | 66.227 | 1.692 | 63.800 | 65.400 | 65.800 | 66.600 | 75.400 | ▇▅▁▁▁ |
| Usage cont | 2 | 0.993 | 20.897 | 2.999 | 12.900 | 18.120 | 21.440 | 23.740 | 24.600 | ▁▃▃▃▇ |
| Pressure Setpoint | 2 | 0.993 | 47.733 | 2.057 | 44.000 | 46.000 | 46.000 | 50.000 | 52.000 | ▁▇▁▆▁ |
| Carb Rel | 2 | 0.993 | 5.440 | 0.127 | 5.180 | 5.340 | 5.400 | 5.560 | 5.740 | ▂▇▂▃▂ |
| Carb Volume | 1 | 0.996 | 5.369 | 0.111 | 5.147 | 5.287 | 5.340 | 5.465 | 5.667 | ▂▇▃▅▁ |
| Carb Temp | 1 | 0.996 | 141.226 | 4.296 | 130.000 | 138.400 | 140.800 | 143.800 | 154.000 | ▁▆▇▃▁ |
| Hyd Pressure2 | 1 | 0.996 | 20.113 | 17.214 | -50.000 | 0.000 | 26.800 | 34.800 | 61.400 | ▁▁▆▇▁ |
| Hyd Pressure3 | 1 | 0.996 | 19.609 | 16.557 | -50.000 | 0.000 | 27.700 | 33.000 | 49.200 | ▁▁▆▃▇ |
| Density | 1 | 0.996 | 1.177 | 0.382 | 0.060 | 0.920 | 0.980 | 1.600 | 1.840 | ▁▁▇▁▅ |
| Balling | 1 | 0.996 | 2.203 | 0.925 | 0.902 | 1.498 | 1.648 | 3.242 | 3.788 | ▅▇▁▂▅ |
| Pressure Vacuum | 1 | 0.996 | -5.174 | 0.582 | -6.400 | -5.600 | -5.200 | -4.800 | -3.600 | ▁▇▆▃▁ |
| Bowl Setpoint | 1 | 0.996 | 109.624 | 15.017 | 70.000 | 100.000 | 120.000 | 120.000 | 130.000 | ▁▂▁▃▇ |
| Air Pressurer | 1 | 0.996 | 142.831 | 1.231 | 141.200 | 142.200 | 142.600 | 142.800 | 147.200 | ▅▇▁▁▁ |
| Carb Pressure | 0 | 1.000 | 68.255 | 3.857 | 60.200 | 65.300 | 68.000 | 70.600 | 77.600 | ▃▆▇▃▂ |
| Mnf Flow | 0 | 1.000 | 21.034 | 117.756 | -100.200 | -100.000 | 0.200 | 141.300 | 220.400 | ▇▁▁▆▂ |
| Hyd Pressure1 | 0 | 1.000 | 12.010 | 13.531 | -50.000 | 0.000 | 10.400 | 20.400 | 50.000 | ▁▁▇▆▂ |
| Carb Flow | 0 | 1.000 | 2408.644 | 1161.364 | 0.000 | 1083.000 | 3038.000 | 3215.000 | 3858.000 | ▂▃▁▆▇ |
| Balling Lvl | 0 | 1.000 | 2.051 | 0.883 | 0.000 | 1.380 | 1.480 | 3.080 | 3.420 | ▁▃▇▁▇ |
Character Variables in Beverage Data
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 8 | 0.97 | 1 | 1 | 0 | 4 | 0 |
## [1] 267 34
# include any rows where any column is NA. Always include id column
#
# NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
# for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id ) %>% relocate(id)-> missing_test_rows
dim(missing_test_rows)## [1] 67 28
Number of Rows with NA
| num_na | count |
|---|---|
| 1 | 50 |
| 2 | 11 |
| 3 | 3 |
| 4 | 1 |
| 8 | 1 |
| 14 | 1 |
## [1] "There are 107 cells with NA values (excluding PH) out of 8544 equivalent to: 1.25%"
We conclude that the distribution of missing values in the test set is:
The scatterplots of the predictor variables against the dependent variable PH indicate no clear linear relationship with any predictor variable. The scatterplots indicate some variable measurements are not continuous and instead fall on specific values on the x-axis.
The density plots of each predictor variable show many predictor variables with non-normal distributions.
The Shapiro-Wilk test of normality is performed on all the numeric predictor variables and the resulting list confirms no p-value above 0.05 and thus no predictor variable follows a normal distribution per evaluation.
## [1] 0
The distribution of the Brand Code, the lone non-numeric predictor variable, shows a high count of brand B, comprising almost half of the samples. Also note, some values are missing for the Brand Code variable.
The boxplot of PH values by Brand Code certainly captures overlap of the middle two quartiles, which seems reasonable given the PH range is rather narrow. Of note, the median values of the 4 labeled brands and the unlabeled group do appear distinct besides Brand A and the unlabeled.
To find significant variables for the models, the VSURF algorithm is applied to the provided dataset. The algorithm identifies 14 variables as significant to the predictions.
## Thresholding step
## Estimated computational time (on one core): 12.3 sec.
##
|
| | 0%
|
|==== | 5%
|
|======= | 10%
|
|========== | 15%
|
|============== | 20%
|
|================== | 25%
|
|===================== | 30%
|
|======================== | 35%
|
|============================ | 40%
|
|================================ | 45%
|
|=================================== | 50%
|
|====================================== | 55%
|
|========================================== | 60%
|
|============================================== | 65%
|
|================================================= | 70%
|
|==================================================== | 75%
|
|======================================================== | 80%
|
|============================================================ | 85%
|
|=============================================================== | 90%
|
|================================================================== | 95%
|
|======================================================================| 100%
## Interpretation step (on 26 variables)
## Estimated computational time (on one core): between 799.5 sec. and 2356.1 sec.
##
|
| | 0%
|
|=== | 4%
|
|===== | 8%
|
|======== | 12%
|
|=========== | 15%
|
|============= | 19%
|
|================ | 23%
|
|=================== | 27%
|
|====================== | 31%
|
|======================== | 35%
|
|=========================== | 38%
|
|============================== | 42%
|
|================================ | 46%
|
|=================================== | 50%
|
|====================================== | 54%
|
|======================================== | 58%
|
|=========================================== | 62%
|
|============================================== | 65%
|
|================================================ | 69%
|
|=================================================== | 73%
|
|====================================================== | 77%
|
|========================================================= | 81%
|
|=========================================================== | 85%
|
|============================================================== | 88%
|
|================================================================= | 92%
|
|=================================================================== | 96%
|
|======================================================================| 100%
## Prediction step (on 17 variables)
## Maximum estimated computational time (on one core): 970.2 sec.
##
|
| | 0%
|
|==== | 6%
|
|======== | 12%
|
|============ | 18%
|
|================ | 24%
|
|===================== | 29%
|
|========================= | 35%
|
|============================= | 41%
|
|================================= | 47%
|
|===================================== | 53%
|
|========================================= | 59%
|
|============================================= | 65%
|
|================================================= | 71%
|
|====================================================== | 76%
|
|========================================================== | 82%
|
|============================================================== | 88%
|
|================================================================== | 94%
|
|======================================================================| 100%
## ** VSURF results **
## The results object is a list of length around 20.
## Most interesting components are the following:
##
## name description
## 1 "$varselect.thres" "variables selected after thesholding step"
## 2 "$varselect.interp" "variables selected after interpretation step"
## 3 "$varselect.pred" "variables selected after prediction step"
## 4 "$ord.imp$x" "mean VI in decreasing order for all variables"
## 5 "$ord.imp$ix" "indices of the ordering of all variables VI mean"
## 6 "mean.perf" "mean OOB rate for RF build on all variables"
##
## For more information about VSURF outputs see the VSURF help page
##
## VSURF computation time: 36.2 mins
##
## VSURF selected:
## 26 variables at thresholding step (in 7.4 secs)
## 17 variables at interpretation step (in 23.4 mins)
## 16 variables at prediction step (in 12.6 mins)
## [1] 9 19 30 16 29 26 31 20 18 23 25 24 27 17 12 28
Next we partition the student evaluation data set by training and test sets. We will retain the terminology of test set to refer to the validation data set used for model performance assessment outside of the cross validation process.
set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )
sdata_test = sdata_v1[-train_indices , ]
sdata_train = sdata_v1[ train_indices , ]
sdataY_test = sdata_v1$PH[-train_indices ]
sdataY_train = sdata_v1$PH[ train_indices ]We conclude there are no near zero variance predictors or constant predictors to be dropped from the training data set.
No Near Zero Variance Predictors in Training
| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| id | 1.00 | 100.00 | FALSE | FALSE |
| PH | 1.05 | 2.48 | FALSE | FALSE |
| CarbVolume | 1.03 | 4.72 | FALSE | FALSE |
| FillOunces | 1.18 | 4.38 | FALSE | FALSE |
| PCVolume | 1.19 | 21.22 | FALSE | FALSE |
| CarbPressure | 1.10 | 5.06 | FALSE | FALSE |
| CarbTemp | 1.04 | 5.99 | FALSE | FALSE |
| PSC | 1.14 | 6.23 | FALSE | FALSE |
| PSCFill | 1.11 | 1.56 | FALSE | FALSE |
| PSCCO2 | 1.08 | 0.63 | FALSE | FALSE |
| MnfFlow | 1.08 | 21.75 | FALSE | FALSE |
| CarbPressure1 | 1.02 | 6.37 | FALSE | FALSE |
| FillPressure | 1.78 | 5.06 | FALSE | FALSE |
| HydPressure1 | 31.86 | 11.58 | FALSE | FALSE |
| HydPressure2 | 7.63 | 9.39 | FALSE | FALSE |
| HydPressure3 | 12.71 | 8.91 | FALSE | FALSE |
| HydPressure4 | 1.02 | 1.90 | FALSE | FALSE |
| FillerLevel | 1.05 | 12.90 | FALSE | FALSE |
| FillerSpeed | 1.00 | 10.41 | FALSE | FALSE |
| Temperature | 1.05 | 2.63 | FALSE | FALSE |
| Usagecont | 1.12 | 22.43 | FALSE | FALSE |
| CarbFlow | 1.33 | 23.21 | FALSE | FALSE |
| Density | 1.19 | 3.65 | FALSE | FALSE |
| MFR | 1.04 | 25.94 | FALSE | FALSE |
| Balling | 1.24 | 9.34 | FALSE | FALSE |
| PressureVacuum | 1.43 | 0.78 | FALSE | FALSE |
| OxygenFiller | 1.32 | 15.67 | FALSE | FALSE |
| BowlSetpoint | 2.80 | 0.49 | FALSE | FALSE |
| PressureSetpoint | 1.33 | 0.39 | FALSE | FALSE |
| AirPressurer | 1.03 | 1.41 | FALSE | FALSE |
| AlchRel | 1.10 | 2.53 | FALSE | FALSE |
| CarbRel | 1.02 | 1.95 | FALSE | FALSE |
| BallingLvl | 1.23 | 3.89 | FALSE | FALSE |
| BrandCode | 2.01 | 0.19 | FALSE | FALSE |
Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum.
\[ \text{zscore_min}(X_i) = \frac{ min(X_i) - \mu(X_i) }{\sigma(X_i)}\] \[\text{zscore_max}(X_i) = \frac{ max(X_i) - \mu(X_i) }{\sigma(X_i )}\]
Predictors with Outliers by Z-Score
| skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.hist | zscore_min | zscore_max |
|---|---|---|---|---|---|---|---|
| MFR | 156 | 0.92 | 704.15 | 74.72 | ▁▁▁▂▇ | -9.00 | 2.20 |
| OxygenFiller | 7 | 1.00 | 0.05 | 0.04 | ▇▁▁▁▁ | -0.97 | 7.88 |
| Temperature | 8 | 1.00 | 65.95 | 1.36 | ▇▃▁▁▁ | -1.73 | 7.55 |
| CarbRel | 6 | 1.00 | 5.44 | 0.13 | ▁▇▇▂▁ | -3.70 | 4.84 |
| AirPressurer | 0 | 1.00 | 142.83 | 1.20 | ▇▇▁▁▁ | -1.52 | 4.48 |
| PSCCO2 | 29 | 0.99 | 0.06 | 0.04 | ▇▅▂▁▁ | -1.31 | 4.31 |
| FillPressure | 15 | 0.99 | 47.91 | 3.14 | ▁▁▇▅▁ | -4.23 | 3.72 |
Using scatterplots of each of the 7 identified predictors below, we see no obvious data errors exist in the distribution of the predictors with outlier values.
Isolated points exist but lie within a reasonable proximity to neighboring points in the scatter plot. Therefore no data corrections for outliers will be applied to the training data.
The code below will transform the special case situations of data pre-processing identified earlier:
E.These rules are applied consistently to be the training, validation and test data sets prior to subsequent pre processing.
We apply the caret pre processing function to that training, validation and test data sets.
# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected. They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale", "knnImpute") )
# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc, as.data.frame(sdata_v2_train ) )
# Becomes the final version of test data for validation
sdata_test_pp = predict( preProcFunc, as.data.frame(sdata_v2_test) )
# Need to generate predictions based on test data without known response
edata_test_pp = predict( preProcFunc, as.data.frame(edata_v2) )
sdata_trainX_pp = sdata_train_pp %>% select( -PH)
sdata_testX_pp = sdata_test_pp %>% select(-PH)
edata_testX_pp = edata_test_pp %>% select( -PH)| Name | sdata_train_pp |
| Number of rows | 2055 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 33 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1297.57 | 751.06 | 1.00 | 644.50 | 1296.00 | 1961.00 | 2568.00 | ▇▇▇▇▇ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▁▁ |
| CarbVolume | 0 | 1 | 0.00 | 1.00 | -3.42 | -0.71 | -0.26 | 0.86 | 2.90 | ▁▃▇▅▁ |
| FillOunces | 0 | 1 | 0.00 | 0.99 | -3.88 | -0.62 | -0.01 | 0.60 | 3.91 | ▁▂▇▂▁ |
| PCVolume | 0 | 1 | 0.01 | 1.00 | -3.72 | -0.60 | -0.07 | 0.59 | 3.01 | ▁▂▇▃▁ |
| CarbPressure | 0 | 1 | 0.01 | 1.00 | -3.11 | -0.72 | 0.04 | 0.70 | 2.88 | ▁▃▇▅▁ |
| CarbTemp | 0 | 1 | 0.01 | 1.00 | -3.47 | -0.65 | 0.01 | 0.69 | 2.90 | ▁▂▇▅▁ |
| PSC | 0 | 1 | 0.00 | 0.99 | -2.75 | -0.65 | 0.01 | 0.66 | 2.83 | ▁▅▇▅▁ |
| PSCFill | 0 | 1 | 0.00 | 1.00 | -1.66 | -0.81 | -0.13 | 0.54 | 3.60 | ▆▇▃▁▁ |
| PSCCO2 | 0 | 1 | 0.00 | 0.99 | -1.31 | -0.84 | -0.37 | 0.33 | 4.31 | ▇▅▂▁▁ |
| MnfFlow | 0 | 1 | 0.00 | 1.00 | -1.05 | -1.05 | 0.50 | 0.97 | 1.58 | ▇▁▁▆▂ |
| CarbPressure1 | 0 | 1 | -0.01 | 1.00 | -3.63 | -0.77 | 0.13 | 0.60 | 3.77 | ▁▃▇▂▁ |
| FillPressure | 0 | 1 | 0.00 | 1.00 | -5.38 | -0.58 | -0.45 | 0.70 | 3.23 | ▁▁▇▆▁ |
| HydPressure1 | 0 | 1 | 0.00 | 1.00 | -1.07 | -1.01 | -0.08 | 0.65 | 3.66 | ▇▅▂▁▁ |
| HydPressure2 | 0 | 1 | 0.00 | 1.00 | -1.29 | -1.29 | 0.47 | 0.83 | 2.34 | ▇▂▇▅▁ |
| HydPressure3 | 0 | 1 | 0.00 | 1.00 | -1.37 | -1.29 | 0.44 | 0.80 | 1.84 | ▇▁▃▇▁ |
| HydPressure4 | 0 | 1 | 0.01 | 1.00 | -3.40 | -0.76 | 0.07 | 0.52 | 2.82 | ▁▅▇▇▁ |
| FillerLevel | 0 | 1 | 0.00 | 1.00 | -2.85 | -0.72 | 0.54 | 0.68 | 4.30 | ▁▅▇▁▁ |
| FillerSpeed | 0 | 1 | -0.05 | 1.05 | -3.44 | 0.20 | 0.41 | 0.44 | 0.51 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 0.00 | 1.00 | -1.91 | -0.57 | -0.25 | 0.38 | 6.50 | ▇▆▁▁▁ |
| Usagecont | 0 | 1 | 0.00 | 1.00 | -2.46 | -0.92 | 0.20 | 0.95 | 1.83 | ▁▇▅▇▆ |
| CarbFlow | 0 | 1 | 0.00 | 1.00 | -2.02 | -1.33 | 0.52 | 0.69 | 2.99 | ▃▁▇▁▁ |
| Density | 0 | 1 | 0.00 | 1.00 | -4.85 | -0.68 | -0.41 | 1.18 | 1.71 | ▁▁▁▇▅ |
| MFR | 0 | 1 | 0.00 | 1.00 | -6.70 | -0.02 | 0.29 | 0.41 | 3.35 | ▁▁▁▇▁ |
| Balling | 0 | 1 | 0.00 | 1.00 | -6.22 | -0.74 | -0.50 | 1.19 | 1.68 | ▁▁▁▇▅ |
| PressureVacuum | 0 | 1 | 0.00 | 1.00 | -2.42 | -0.67 | -0.31 | 0.39 | 2.84 | ▂▇▅▃▂ |
| OxygenFiller | 0 | 1 | 0.00 | 1.00 | -1.98 | -0.35 | 0.03 | 0.62 | 3.21 | ▃▇▆▂▁ |
| BowlSetpoint | 0 | 1 | 0.00 | 1.00 | -2.39 | -0.73 | 0.70 | 0.70 | 2.40 | ▁▃▃▇▁ |
| PressureSetpoint | 0 | 1 | 0.00 | 1.00 | -1.94 | -0.77 | -0.77 | 1.16 | 1.96 | ▁▇▁▆▁ |
| AirPressurer | 0 | 1 | 0.00 | 1.00 | -1.58 | -0.53 | -0.18 | 0.16 | 4.37 | ▅▇▁▁▁ |
| AlchRel | 0 | 1 | 0.00 | 1.00 | -5.09 | -0.72 | -0.62 | 0.79 | 2.76 | ▁▁▇▂▃ |
| CarbRel | 0 | 1 | 0.00 | 1.00 | -4.28 | -0.75 | -0.26 | 0.97 | 4.21 | ▁▂▇▅▁ |
| BallingLvl | 0 | 1 | 0.00 | 1.00 | -2.36 | -0.77 | -0.66 | 1.26 | 1.86 | ▁▇▂▁▆ |
| Name | sdata_test_pp |
| Number of rows | 512 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 33 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1237.47 | 705.94 | 5.00 | 638.00 | 1248.50 | 1803.75 | 2571.00 | ▇▇▇▇▆ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.98 | 8.44 | 8.54 | 8.68 | 8.94 | ▁▃▇▇▃ |
| CarbVolume | 0 | 1 | 0.01 | 1.01 | -3.12 | -0.71 | -0.19 | 0.80 | 2.52 | ▁▃▇▅▂ |
| FillOunces | 0 | 1 | 0.03 | 1.01 | -3.20 | -0.55 | -0.01 | 0.60 | 3.99 | ▁▅▇▂▁ |
| PCVolume | 0 | 1 | -0.01 | 0.93 | -3.47 | -0.55 | -0.04 | 0.58 | 2.79 | ▁▂▇▅▁ |
| CarbPressure | 0 | 1 | 0.00 | 0.99 | -3.54 | -0.72 | -0.02 | 0.71 | 2.74 | ▁▂▇▇▁ |
| CarbTemp | 0 | 1 | -0.02 | 1.00 | -3.47 | -0.65 | -0.09 | 0.64 | 2.90 | ▁▂▇▅▁ |
| PSC | 0 | 1 | -0.05 | 1.06 | -2.53 | -0.81 | -0.08 | 0.67 | 2.76 | ▂▇▇▅▂ |
| PSCFill | 0 | 1 | -0.02 | 0.99 | -1.66 | -0.81 | -0.13 | 0.54 | 3.43 | ▆▇▃▂▁ |
| PSCCO2 | 0 | 1 | 0.07 | 1.03 | -1.31 | -0.84 | -0.37 | 0.56 | 4.31 | ▇▅▂▁▁ |
| MnfFlow | 0 | 1 | -0.06 | 0.98 | -1.05 | -1.05 | -0.22 | 0.92 | 1.69 | ▇▁▁▇▁ |
| CarbPressure1 | 0 | 1 | -0.02 | 1.06 | -3.46 | -0.90 | 0.09 | 0.65 | 3.34 | ▁▅▇▅▁ |
| FillPressure | 0 | 1 | 0.02 | 1.04 | -3.82 | -0.58 | -0.45 | 0.70 | 3.41 | ▁▁▇▅▁ |
| HydPressure1 | 0 | 1 | -0.03 | 1.01 | -1.07 | -1.01 | -0.12 | 0.57 | 3.21 | ▇▅▂▁▁ |
| HydPressure2 | 0 | 1 | -0.04 | 1.00 | -1.29 | -1.29 | 0.42 | 0.81 | 2.08 | ▇▁▆▆▁ |
| HydPressure3 | 0 | 1 | -0.05 | 1.00 | -1.37 | -1.29 | 0.40 | 0.79 | 1.81 | ▇▁▃▇▁ |
| HydPressure4 | 0 | 1 | 0.07 | 0.99 | -2.16 | -0.58 | 0.23 | 0.52 | 2.82 | ▂▃▇▂▁ |
| FillerLevel | 0 | 1 | -0.07 | 1.05 | -2.48 | -1.22 | 0.57 | 0.69 | 3.08 | ▂▃▇▁▁ |
| FillerSpeed | 0 | 1 | -0.16 | 1.18 | -3.43 | 0.05 | 0.35 | 0.44 | 0.48 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 0.06 | 1.05 | -1.91 | -0.57 | -0.10 | 0.38 | 6.30 | ▆▇▁▁▁ |
| Usagecont | 0 | 1 | -0.05 | 1.01 | -2.54 | -1.04 | 0.19 | 0.94 | 1.78 | ▁▇▅▇▇ |
| CarbFlow | 0 | 1 | 0.08 | 1.00 | -2.02 | -1.25 | 0.54 | 0.74 | 1.43 | ▃▁▁▇▅ |
| Density | 0 | 1 | 0.02 | 1.00 | -4.14 | -0.68 | -0.41 | 1.18 | 1.68 | ▁▁▆▇▇ |
| MFR | 0 | 1 | -0.01 | 0.97 | -6.40 | -0.04 | 0.29 | 0.39 | 2.91 | ▁▁▁▇▁ |
| Balling | 0 | 1 | -0.01 | 1.01 | -3.70 | -0.82 | -0.50 | 1.19 | 1.63 | ▁▁▇▂▆ |
| PressureVacuum | 0 | 1 | 0.04 | 1.00 | -2.42 | -0.67 | 0.04 | 0.39 | 2.84 | ▂▇▆▃▂ |
| OxygenFiller | 0 | 1 | 0.09 | 0.98 | -1.98 | -0.27 | 0.08 | 0.69 | 3.21 | ▃▇▇▂▁ |
| BowlSetpoint | 0 | 1 | -0.09 | 1.08 | -2.39 | -1.35 | 0.70 | 0.70 | 2.40 | ▂▃▂▇▁ |
| PressureSetpoint | 0 | 1 | 0.02 | 0.98 | -1.94 | -0.77 | -0.77 | 1.16 | 1.28 | ▁▇▁▁▆ |
| AirPressurer | 0 | 1 | 0.04 | 1.05 | -1.76 | -0.53 | -0.18 | 0.16 | 3.91 | ▂▇▁▁▁ |
| AlchRel | 0 | 1 | 0.00 | 0.99 | -1.08 | -0.72 | -0.67 | 0.68 | 2.74 | ▇▁▂▃▁ |
| CarbRel | 0 | 1 | -0.01 | 1.00 | -3.67 | -0.75 | -0.26 | 0.83 | 3.02 | ▁▂▇▅▁ |
| BallingLvl | 0 | 1 | 0.01 | 1.01 | -2.36 | -0.77 | -0.68 | 1.26 | 1.65 | ▁▆▇▁▇ |
| Name | edata_test_pp |
| Number of rows | 267 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| PH | 267 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 134.00 | 77.22 | 1.00 | 67.50 | 134.00 | 200.50 | 267.00 | ▇▇▇▇▇ |
| CarbVolume | 0 | 1 | -0.02 | 1.04 | -2.23 | -0.81 | -0.26 | 0.89 | 2.63 | ▂▇▅▅▁ |
| FillOunces | 0 | 1 | -0.04 | 0.86 | -2.60 | -0.62 | -0.09 | 0.45 | 2.60 | ▁▅▇▃▁ |
| PCVolume | 0 | 1 | 0.01 | 1.04 | -3.28 | -0.70 | 0.01 | 0.75 | 2.82 | ▁▃▇▇▁ |
| CarbPressure | 0 | 1 | 0.01 | 1.08 | -2.42 | -0.80 | -0.02 | 0.70 | 2.46 | ▂▆▇▅▂ |
| CarbTemp | 0 | 1 | 0.03 | 1.06 | -3.04 | -0.65 | -0.04 | 0.71 | 2.90 | ▁▃▇▃▁ |
| PSC | 0 | 1 | -0.01 | 1.08 | -2.53 | -0.75 | 0.01 | 0.66 | 2.55 | ▂▅▇▃▂ |
| PSCFill | 0 | 1 | -0.05 | 0.96 | -1.49 | -0.73 | -0.13 | 0.46 | 3.60 | ▆▇▃▁▁ |
| PSCCO2 | 0 | 1 | -0.12 | 0.89 | -1.31 | -0.84 | -0.37 | 0.09 | 4.31 | ▇▃▂▁▁ |
| MnfFlow | 0 | 1 | -0.04 | 0.98 | -1.05 | -1.05 | -0.22 | 0.96 | 1.62 | ▇▁▁▆▂ |
| CarbPressure1 | 0 | 1 | 0.09 | 0.94 | -2.05 | -0.53 | 0.17 | 0.60 | 2.87 | ▂▃▇▂▁ |
| FillPressure | 0 | 1 | 0.06 | 1.10 | -3.82 | -0.58 | 0.01 | 0.76 | 3.37 | ▁▁▇▆▁ |
| HydPressure1 | 0 | 1 | -0.04 | 1.09 | -5.04 | -1.01 | -0.17 | 0.63 | 3.02 | ▁▁▇▆▂ |
| HydPressure2 | 0 | 1 | -0.07 | 1.05 | -4.34 | -1.29 | 0.35 | 0.83 | 2.46 | ▁▁▆▇▁ |
| HydPressure3 | 0 | 1 | -0.07 | 1.04 | -4.42 | -1.29 | 0.44 | 0.77 | 1.79 | ▁▁▆▃▇ |
| HydPressure4 | 0 | 1 | 0.13 | 1.05 | -2.63 | -0.41 | 0.23 | 0.66 | 2.73 | ▁▂▇▃▂ |
| FillerLevel | 0 | 1 | 0.06 | 1.01 | -2.32 | -0.66 | 0.56 | 0.69 | 3.51 | ▃▃▇▁▁ |
| FillerSpeed | 0 | 1 | -0.23 | 1.25 | -3.43 | 0.04 | 0.28 | 0.44 | 0.48 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 0.21 | 1.20 | -1.74 | -0.40 | -0.09 | 0.53 | 6.09 | ▇▇▂▁▁ |
| Usagecont | 0 | 1 | -0.05 | 1.01 | -2.37 | -1.02 | 0.06 | 0.94 | 1.29 | ▁▅▃▂▇ |
| CarbFlow | 0 | 1 | -0.03 | 1.07 | -2.02 | -1.36 | 0.53 | 0.73 | 1.46 | ▅▁▁▇▃ |
| Density | 0 | 1 | -0.02 | 1.14 | -9.22 | -0.64 | -0.41 | 1.14 | 1.58 | ▁▁▁▃▇ |
| MFR | 0 | 1 | -0.05 | 1.16 | -6.71 | -0.01 | 0.29 | 0.41 | 1.50 | ▁▁▁▁▇ |
| Balling | 0 | 1 | 0.00 | 1.00 | -1.98 | -0.74 | -0.50 | 1.16 | 1.54 | ▁▇▃▁▇ |
| PressureVacuum | 0 | 1 | 0.08 | 1.02 | -2.07 | -0.67 | 0.04 | 0.74 | 2.84 | ▁▇▆▃▁ |
| OxygenFiller | 0 | 1 | 0.02 | 0.99 | -1.98 | -0.41 | 0.05 | 0.55 | 3.20 | ▅▇▇▂▁ |
| BowlSetpoint | 0 | 1 | 0.00 | 0.99 | -2.39 | -0.73 | 0.70 | 0.70 | 1.52 | ▂▂▃▇▁ |
| PressureSetpoint | 0 | 1 | 0.06 | 1.00 | -1.94 | -0.77 | -0.77 | 1.16 | 1.96 | ▁▇▁▆▁ |
| AirPressurer | 0 | 1 | 0.01 | 1.03 | -1.40 | -0.53 | -0.18 | -0.01 | 3.60 | ▅▇▁▁▁ |
| AlchRel | 0 | 1 | 0.01 | 0.98 | -1.08 | -0.72 | -0.62 | 0.68 | 1.74 | ▇▁▁▁▃ |
| CarbRel | 0 | 1 | 0.02 | 0.99 | -2.14 | -0.75 | -0.26 | 0.97 | 2.24 | ▁▇▃▃▃ |
| BallingLvl | 0 | 1 | 0.00 | 1.02 | -2.36 | -0.77 | -0.66 | 1.19 | 1.58 | ▁▃▇▁▇ |
First, we consider a correlation matrix where pairwise complete observations are allowed for use in computing correlations. Due to the low percentage of incomplete cells in the dataset, we prefer this dropping observations with any incomplete columns.
The correlation matrix below uses hierarchical clustering of the predictors to form 6 groups of variables using the corrplot package. This was the most efficient way to gain insight on the related clusters of variables.
M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )We compare the correlations to the cleaned pre-processed training data.
Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )Our last step of preprocessing is to export the data sets. This need only but done once.
readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )
readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )
readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )The above data files have the following properties and relationships to the original raw files:
Pre-Processed Data Files
| filename | primary_key | response | numeric_columns | num_rows | file_source |
|---|---|---|---|---|---|
| sdata_train_pp.csv | id - consistent with sdata_test_pp | PH - unchanged | BoxCox, center, scaled, imputed | 2055 | StudentData.xlsx |
| sdata_test_pp.csv | id - consistent with sdata_train_pp | PH - unchanged | BoxCox, center, scaled, imputed | 512 | StudentData.xlsx |
| edata_test_pp.csv | id - standalone | PH - unchanged | BoxCox, center, scaled, imputed | 267 | StudentEvaluation.xlsx |
An important note about the id column. The id in the file edata_test_pp.csv is the row number of the same observation in StudentEvaluation.xlsx. But the same is not true for sdata_train_pp.csv and sdata_test_pp.csv because they are sampled from the source file StudentData.xlsx. The 10th row of sdata_test_pp.csv is not necessarily the 10th row of StudentData.xlsx rather one should use the id field as the corresponding row number of the original file.
We load the pre-processed datasets for training and testing in this section. We load the csv file for the training data set below and display the first few column characteristics from the file.
sdata_train_pp = read_csv("sdata_train_pp.csv",
col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
sdata_test_pp = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
sdata_trainX_pp = sdata_train_pp %>% select( -PH, -id )
sdata_testX_pp = sdata_test_pp %>% select(-PH , -id)tuningFull = TRUE # Set to TRUE if running the lengthy tuning process for all modelsWe consider the same dataset and variable importance problem using gradient boosted trees. The gbm library and variable importance measures are run within caret training procedure. The gradient boosting approach relies on a weak learner in an additive manner that minimizes the given loss function.
set.seed(1027)
gbmControlFull <- trainControl(method = "repeatedcv",
number = 10, # for debug mode, set this to a low number (1)
repeats = 10 ,
selectionFunction = "best",
verboseIter = FALSE)
gbmControlLight <- trainControl(method = "repeatedcv",
number = 5, # for debug mode, set this to a low number (1)
repeats = 1 ,
selectionFunction = "best",
verboseIter = FALSE)
if(tuningFull == TRUE)
{
gbmControl = gbmControlFull
gbmTune = expand.grid( n.trees = seq(250, 1000, by = 250),
shrinkage = c( 0.02, 0.05, 0.1) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 15
)
} else {
gbmControl = gbmControlLight
gbmTune = expand.grid( n.trees = c( 2000),
shrinkage = c( 0.02, 0.05 , 0.1 ) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 4
)
}
(gbmTune = caret::train( x = sdata_trainX_pp,
y = sdata_train_pp$PH ,
method = "gbm",
tuneGrid = gbmTune ,
verbose = FALSE,
metric = "RMSE" ,
trControl = gbmControl ) )## Stochastic Gradient Boosting
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.02 3 250 0.1434366 0.3445322 0.1130661
## 0.02 3 500 0.1399034 0.3781615 0.1093662
## 0.02 3 750 0.1381810 0.3948955 0.1075668
## 0.02 3 1000 0.1371433 0.4053153 0.1065292
## 0.02 7 250 0.1375662 0.3998266 0.1072462
## 0.02 7 500 0.1339808 0.4329093 0.1037283
## 0.02 7 750 0.1326817 0.4453548 0.1025127
## 0.02 7 1000 0.1320709 0.4515327 0.1019375
## 0.05 3 250 0.1391827 0.3846115 0.1085754
## 0.05 3 500 0.1370326 0.4073776 0.1063487
## 0.05 3 750 0.1364019 0.4163811 0.1056864
## 0.05 3 1000 0.1360931 0.4213869 0.1053766
## 0.05 7 250 0.1337931 0.4330732 0.1032574
## 0.05 7 500 0.1324935 0.4472241 0.1018857
## 0.05 7 750 0.1319768 0.4531566 0.1013501
## 0.05 7 1000 0.1317495 0.4562800 0.1010483
## 0.10 3 250 0.1379832 0.3995082 0.1072565
## 0.10 3 500 0.1371777 0.4135910 0.1063966
## 0.10 3 750 0.1368283 0.4201240 0.1059677
## 0.10 3 1000 0.1366484 0.4244918 0.1056588
## 0.10 7 250 0.1343586 0.4341016 0.1034324
## 0.10 7 500 0.1335822 0.4437819 0.1027288
## 0.10 7 750 0.1332925 0.4472134 0.1023291
## 0.10 7 1000 0.1332832 0.4482130 0.1022717
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 15
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 7, shrinkage = 0.05 and n.minobsinnode = 15.
As the model output indicates the greater tree depth outperforms the lower counts and the shrinkage value of 0.05 produces the best RMSE value. The shrinkage, or learning rate, typically works better with smaller values, yet require longer processing time.
## Saving 7 x 5 in image
The plot of the tuned model clearly shows the better performance with the increased max tree depth but also indicates the lower shrinkage value doesn’t guarantee a lower RMSE. For the max tree depth of 3 and 7, the lowest shrinkage value does not produce the lowest RMSE for that tree depth.
## Saving 7 x 5 in image
The variable importance plot shows the 100% importance of the MnfFlow followed by BrandCode as the only other variable with an importance above 50%. Usagecont, OxygenFiller, AlchRel, and Temperature round out the top 6 variables based on importance with scores greater than 25%.
Overall, the final Gradient Boosted model results in an RMSE of approximately 0.132 and an \(R^2\) less than 0.5.
Now we train a Cubist tree using the best selectionFunction on RMSE. The Cubist library and variable importance measures are run within caret training procedure.
The Cubist model is a rules-based approach in which the terminal leaves contains linear regression models. The models are based on predictions defined in the previous nodes of the tree.
set.seed(1027)
cubistControlFull <- trainControl(method = "cv" , selectionFunction = "best")
tuneGridFull <- expand.grid( committees = c( 10, 50, 100 ) ,
neighbors = c( 0, 1, 5, 9 )
)
cubistControlLight <- trainControl(method = "cv" , selectionFunction = "best")
tuneGridLight <- expand.grid( committees = c( 10, 20 ) ,
neighbors = c( 0, 5 ) )
if(tuningFull == TRUE)
{
cubistControl = cubistControlFull
cubistGrid = tuneGridFull
} else {
cubistControl = cubistControlLight
cubistGrid = tuneGridLight
}
(cubistTune = caret::train( x = sdata_trainX_pp,
y = sdata_train_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )## Cubist
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1850, 1849, 1849, 1851, 1850, 1849, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 10 0 0.10326126 0.6469501 0.07521436
## 10 1 0.10698108 0.6394174 0.07408043
## 10 5 0.09661241 0.6865169 0.06923965
## 10 9 0.09715807 0.6838488 0.06954861
## 50 0 0.10213076 0.6574062 0.07362497
## 50 1 0.10583349 0.6449193 0.07208654
## 50 5 0.09539318 0.6943587 0.06757449
## 50 9 0.09598916 0.6917813 0.06811918
## 100 0 0.10218813 0.6576706 0.07366847
## 100 1 0.10587327 0.6449503 0.07201655
## 100 5 0.09547621 0.6941947 0.06752479
## 100 9 0.09606335 0.6917983 0.06810842
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 50 and neighbors = 5.
The Cubist model results provide a strong RMSE value and \(R^2\) value with generally a higher number of committees and neighbors.
The line plot confirms the better model per RMSE with neighbors at 5 and committees set to 50.
## Saving 7 x 5 in image
The variable importance plot identifies MnfFlow as the most valuable variable followed by Balling, BallingLvl, AlchRel, and PressureVacuum to round out the top 5 variables. Overall, 7 variables resulted in over 50% importance indicating those variables are contained in over 50% of the rules defining the Cubist model.
## Saving 7 x 5 in image
The final Cubist model confirms the high usage rate of the variable MnfFlow.
Cubist Predictor Usage
| Conditions | Model | Variable |
|---|---|---|
| 82 | 61 | MnfFlow |
| 18 | 81 | Balling |
| 22 | 71 | BallingLvl |
| 19 | 73 | AlchRel |
| 28 | 60 | PressureVacuum |
| 19 | 63 | OxygenFiller |
| 14 | 66 | Density |
| 21 | 51 | BowlSetpoint |
| 8 | 61 | HydPressure3 |
| 7 | 62 | Temperature |
| 9 | 59 | AirPressurer |
| 18 | 47 | CarbFlow |
| 6 | 58 | CarbRel |
| 3 | 54 | HydPressure2 |
| 3 | 52 | CarbPressure1 |
| 11 | 40 | Usagecont |
| 16 | 31 | FillerSpeed |
| 6 | 39 | HydPressure1 |
| 41 | 0 | BrandCode |
| 3 | 33 | FillerLevel |
| 0 | 27 | CarbPressure |
| 3 | 20 | CarbVolume |
| 2 | 21 | PCVolume |
| 0 | 20 | CarbTemp |
| 0 | 20 | PressureSetpoint |
| 0 | 17 | FillPressure |
| 0 | 16 | HydPressure4 |
| 0 | 16 | PSCFill |
| 0 | 6 | MFR |
| 0 | 5 | FillOunces |
| 0 | 4 | PSC |
| 0 | 3 | PSCCO2 |
The splits plot indicates the usage of each variable by committee/rule. The values are normalized and highlight the high usage rate of MnfFlow along with the use of greater than or less than inequalities for MnfFlow.
## quartz_off_screen
## 2
The dotplot allows us to see the distribution of the coefficient parameters for each linear model associated with a committee/rule.
When a predictor has more dots, it is used in more models which suggests the variable is influential. The color density is proportional to the variable importance. The sign of the coefficient predicts the marginal effect of the predictor on the response PH. However, since Cubist uses many models, the concept of the sign of a predictor coefficient can be unclear.
The x-axis value of the dot in the dotplot indicates the true coefficient in the original units unlike the above plot which normalizes the x-axis.
## quartz_off_screen
## 2
The boxplot below which we devised constructs a distribution of coefficients for each predictor. We interpret the sign of the predictor \(X\) marginal effect to be the sign of the median of the distribution of coefficients pf \(X\) over those models which include \(X\). The predictors have been sorted by their median.
The plot provides clarity regarding the impact of a given variable as the median is a good estimate of the sign of the variable coefficient. The AlchRel variable box indicates a positive relationship with the dependent variable PH. The most important variable MnfFlow box results in a negative coefficient, and thus an inverse effect on PH. For all the rules in the final Cubist model, the understanding of the coefficient values for each variable provides a clear relationship to the dependent variable despite the obtuse nature of the Cubist model construction.
## Saving 7 x 5 in image
Overall, the final Cubist model results in an RMSE under 0.1 and a \(R^2\) just below of 0.7.
Using the rpart library and its training methods, we consider the best CART based on the RMSE metric. The CART approach follows a classification and regression tree methodology.
set.seed(10393)
library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 20, by = 1 ) )
cartControl <- trainControl(method = "boot",
number = 30,
selectionFunction = "best")
(rpartTune = train( x = sdata_trainX_pp,
y = sdata_train_pp$PH ,
method = "rpart2" ,
metric = "RMSE" ,
tuneGrid = cartGrid ,
trControl = cartControl ) )## CART
##
## 2055 samples
## 32 predictor
##
## No pre-processing
## Resampling: Bootstrapped (30 reps)
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 0.1540934 0.2150028 0.1207013
## 2 0.1485972 0.2697600 0.1163125
## 3 0.1439126 0.3150026 0.1130389
## 4 0.1418976 0.3346403 0.1106321
## 5 0.1408007 0.3455111 0.1095038
## 6 0.1398046 0.3557469 0.1085050
## 7 0.1389994 0.3635874 0.1074293
## 8 0.1381955 0.3717156 0.1065335
## 9 0.1373228 0.3802125 0.1057161
## 10 0.1362384 0.3899605 0.1046780
## 11 0.1353910 0.3979319 0.1038336
## 12 0.1345705 0.4055253 0.1028965
## 13 0.1340562 0.4106769 0.1022712
## 14 0.1335638 0.4154690 0.1017284
## 15 0.1330891 0.4202569 0.1012114
## 16 0.1327996 0.4230057 0.1008957
## 17 0.1327006 0.4239743 0.1008136
## 18 0.1326532 0.4245372 0.1007342
## 19 0.1326119 0.4250877 0.1006977
## 20 0.1326119 0.4250877 0.1006977
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 19.
The model output shows increasing tree depth leads to improved RMSE and \(R^2\) values.
The below plot displays the improving RMSE value with the increasing tree depth with an optimal tree depth of 19.
## Saving 7 x 5 in image
Once again, the variable MnfFlow is the most important variable with an importance rating of 100%. Six variables achieve an importance score of greater than 50% with BrandCode, Temperature, and CarbRel each achieving greater than 75% importance.
## Saving 7 x 5 in image
The CART tree plot of the final model reaffirms the importance of the variable MnfFlow as the initial node splits on MnfFlow. One of the second level nodes is BrandCode, another highly important variable to the CART model. Eight ff the top 10 most important variables in the Cubist model are found in the top 15 variables of the CART model. This overlap of variable importance shows a consistency across the different model approaches.
Overall, the final CART model results in an RMSE of approximately 0.132 and an \(R^2\) less than 0.5.
Now we train a MARS model based on the implementation of earth from the caret library. The MARS approach is a non-linear regression model that does not rely on the original variables but instead on derived features. The tuned MARS model defines the derived features as linear combinations of the initial variables.
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598 RSS 27.25932 GRSq 0.5260713 RSq 0.5555922
## Call: earth(x=tbl_df[2055,32], y=c(8.36,8.26,8.9...), keepxy=TRUE, degree=2,
## nprune=27)
##
## coefficients
## (Intercept) 8.3958776
## h(-0.215889-MnfFlow) 0.1780017
## h(HydPressure3- -1.29159) 0.0391037
## h(-0.728732-Temperature) -0.0805473
## h(Temperature- -0.728732) -0.0477016
## h(-0.658787-Balling) 0.3588441
## h(-0.832915-OxygenFiller) -0.0445311
## h(OxygenFiller- -0.832915) -0.0417267
## h(BowlSetpoint- -1.34947) 0.0724302
## h(0.640105-AlchRel) 0.0395901
## h(AlchRel-0.640105) 0.1293063
## h(1.05286-MnfFlow) * BrandCodeC -0.0902482
## h(MnfFlow-1.05286) * BrandCodeC -1.2433037
## h(-0.215889-MnfFlow) * h(PressureVacuum-0.386629) -0.0625259
## h(-0.215889-MnfFlow) * h(0.386629-PressureVacuum) -0.1069156
## h(-0.215889-MnfFlow) * h(AirPressurer-0.839882) -0.1043474
## h(-0.215889-MnfFlow) * h(0.839882-AirPressurer) 0.0605090
## h(MnfFlow- -0.215889) * h(0.640105-AlchRel) -0.0669114
## h(1.75812-CarbPressure1) * h(BowlSetpoint- -1.34947) -0.0120925
## h(0.1987-HydPressure1) * h(-0.658787-Balling) -0.2036569
## h(0.21892-FillerSpeed) * h(Balling- -0.658787) 0.0158849
## h(FillerSpeed-0.21892) * h(Balling- -0.658787) 0.2165359
## h(Temperature-1.27402) * h(BowlSetpoint- -1.34947) 0.0275573
## h(Usagecont-0.0816887) * h(Balling- -0.658787) -0.0381250
## h(Density-0.482509) * h(BowlSetpoint- -1.34947) -0.0578527
## h(0.386629-PressureVacuum) * h(0.640105-AlchRel) 0.0204706
## h(OxygenFiller- -0.832915) * h(AirPressurer- -0.700718) 0.0305074
##
## Selected 27 of 45 terms, and 15 of 35 predictors (nprune=27)
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: MnfFlow, BrandCodeC, AlchRel, AirPressurer, Balling, ...
## Number of terms at each degree of interaction: 1 10 16
## GCV 0.0141598 RSS 27.25932 GRSq 0.5260713 RSq 0.5555922
The plot of the final MARS model indicates increasing the number of terms leads to a lower RMSE.
## Saving 7 x 5 in image
Once again, the variable MnfFlow achieves an importance score of 100%. Only 3 variables reach the score of 50%, the aforementioned MnfFlow, BrandCodeC, and AlchRel.
The variable MnfFlow is influential because the variable effects many other variables in the final model. Due to the numerous linear combinations including MnfFlow, the relationship to PH is difficult to ascertain. The MnfFlow variable requires more investigation into the variable interactions to determine the impact to PH for the MARS model.
## Saving 7 x 5 in image
Overall, the MARS model reaffirms the importance of the variable MnfFlow and also results in an RMSE and \(R^2\) in line with the previous models for Gradient Boosted Trees and CART.
We also run a simple OLS regression on PH in terms of the scaled and centered variables. While its predictive performance is weaker than the other models on the training set, the coefficients are much easier to interpret.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51932 -0.07897 0.00917 0.08626 0.44987
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.586130 0.010428 823.364 < 2e-16 ***
## CarbVolume -0.005488 0.008389 -0.654 0.513047
## FillOunces -0.004105 0.003163 -1.298 0.194451
## PCVolume -0.006448 0.003678 -1.753 0.079740 .
## CarbPressure -0.003088 0.011767 -0.262 0.793033
## CarbTemp 0.008475 0.010712 0.791 0.428936
## PSC -0.003168 0.003147 -1.007 0.314160
## PSCFill -0.005762 0.003066 -1.879 0.060361 .
## PSCCO2 -0.006170 0.003020 -2.043 0.041221 *
## MnfFlow -0.089502 0.006520 -13.728 < 2e-16 ***
## CarbPressure1 0.030465 0.003680 8.278 2.25e-16 ***
## FillPressure 0.005601 0.004501 1.244 0.213528
## HydPressure1 0.004482 0.005177 0.866 0.386658
## HydPressure2 -0.027126 0.009963 -2.723 0.006533 **
## HydPressure3 0.059090 0.010506 5.624 2.12e-08 ***
## HydPressure4 0.001523 0.004624 0.329 0.741982
## FillerLevel -0.021768 0.008733 -2.493 0.012760 *
## FillerSpeed 0.003863 0.004923 0.785 0.432809
## Temperature -0.019234 0.003564 -5.397 7.59e-08 ***
## Usagecont -0.020299 0.003911 -5.191 2.31e-07 ***
## CarbFlow 0.012499 0.004515 2.768 0.005685 **
## Density -0.018774 0.009023 -2.081 0.037598 *
## MFR 0.002030 0.003898 0.521 0.602651
## Balling -0.091337 0.014856 -6.148 9.43e-10 ***
## PressureVacuum -0.016446 0.004588 -3.585 0.000345 ***
## OxygenFiller -0.015182 0.004217 -3.600 0.000326 ***
## BowlSetpoint 0.053190 0.009135 5.823 6.72e-09 ***
## PressureSetpoint -0.013892 0.004500 -3.087 0.002047 **
## AirPressurer -0.002649 0.003204 -0.827 0.408420
## AlchRel 0.034578 0.011814 2.927 0.003463 **
## CarbRel 0.005015 0.006593 0.761 0.447003
## BallingLvl 0.097856 0.017857 5.480 4.79e-08 ***
## BrandCodeA -0.095705 0.025430 -3.764 0.000172 ***
## BrandCodeC -0.139671 0.010233 -13.649 < 2e-16 ***
## BrandCodeD -0.043954 0.029625 -1.484 0.138054
## BrandCodeE -0.057782 0.015435 -3.744 0.000186 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1325 on 2019 degrees of freedom
## Multiple R-squared: 0.4223, Adjusted R-squared: 0.4123
## F-statistic: 42.16 on 35 and 2019 DF, p-value: < 2.2e-16
The most significant variables are MnfFlow, CarbPressure, HydPressure3, Temperature, Usagecont, Balling, PressureVacuum, OxygenFiller, BowlSetpoint, BallingLvl, and BrandCode. Seven of these 11 variables are found in the top 10 most important variables of the Cubist model. Upon closer evaluation, the median value of the variable coefficients from the top 10 of the Cubist model match the sign of the coefficient value in the linear regression model. This finding further supports the assessment of the Cubist model variable impacts on the dependent variable PH.
We summarize the plot of coefficients below.
## NULL
The diagnostic plots of the OLS model suggest that the histogram of residuals is relatively symmetric and looks normal.
The QQ plot suggests the model fit does not fully capture the fat tails possibly due to a small number of outliers.
Overall, the linear regression model does not perform as well as the Cubist model but does provide a valuable sanity check and baseline to the results from the more advanced models.
Based on the above results, we summarize their training and test RMSE and \(R^2\) below.
The joint results are tabulated below.
# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()
model_list = list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)
model_stats = data.frame()
for( modelObj in model_list )
{
if( modelObj$method != 'lm' ){
pred_data <- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
}
else
{
pred_data <- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
}
output <- data.frame( modelName = modelObj$method ,
trainRMSE = getTrainPerf(modelObj)[1,1] ,
trainR2 = getTrainPerf(modelObj)[1,2] ,
testRMSE = caret::RMSE( pred_data, sdata_test_pp$PH) ,
testR2 = caret::R2( pred_data, sdata_test_pp$PH)
)
model_stats <- rbind(model_stats, output)
}Model Results Comparison
| modelName | trainRMSE | trainR2 | testRMSE | testR2 |
|---|---|---|---|---|
| cubist | 0.095 | 0.694 | 0.090 | 0.722 |
| rpart2 | 0.133 | 0.425 | 0.125 | 0.474 |
| lm | 0.134 | 0.396 | 0.127 | 0.449 |
| gbm | 0.132 | 0.456 | 0.151 | 0.342 |
| earth | 0.125 | 0.486 | 0.153 | 0.279 |
## save_kable will have the best result with magick installed.
We also inspect the plot of the predicted versus the observed data for the 5 models. The scatterplots below confirm the Cubist model outperforms the others on both the training set in the upper row and also the test data on the lower rows. The distribution of the Cubist plot mostly closely aligns with the 45-degree line which corresponds to an accurate prediction.
We conclude that the best model is Cubist but the multilinear regression model, MARS and CART models provide useful insight on the main predictors.
The top 5 variables that impact PH are:
MnfFlow - this appears negatively related to PH level based on OLS, MARS and Cubist median plot.
BallingLvl is positively related to PH level.
PreserveVacuum is negatively related to PH.
AlchRel is positively related to PH.
Balling is negatively related to PH.
Lastly, we observe that Brand is probably not a manufacturing predictor that can be controlled. Mostly likely Brand refers to a recipe for the final product. However, knowing the desired optimal PH for each Brand may help us control the most influential variables that control PH, thereby providing better quality control. While there are over 30 predictors in the dataset, we recommend looking at the top 10 most influential predictors. Most of the models agree substantially on which predictors to include in the top 10.
For example, the linear regression model coefficients agree with the sign and variable importance of the top 10 predictors of the Cubist model. Importance predictors of the OLS model have the highest statistical significance - consistent with variable importance measure of caret.
## Cubist
##
## 2567 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2310, 2309, 2310, 2310, 2310, 2311, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 10 0 0.09786899 0.6814835 0.07063050
## 10 1 0.10347326 0.6605109 0.07157433
## 10 5 0.09205621 0.7148932 0.06565672
## 10 9 0.09275054 0.7104184 0.06629558
## 50 0 0.09733927 0.6877932 0.07034005
## 50 1 0.10284267 0.6632483 0.07070141
## 50 5 0.09105650 0.7209833 0.06455347
## 50 9 0.09179098 0.7169196 0.06532125
## 100 0 0.09682465 0.6925441 0.07016957
## 100 1 0.10246532 0.6656340 0.07043498
## 100 5 0.09042487 0.7248834 0.06419195
## 100 9 0.09121584 0.7206745 0.06503557
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 100 and neighbors = 5.
RMSE of the full StudentData dataset: 0.0904249.
\(R^2\) of the full StudentData dataset: 0.7248834.
We summarize all the R code used in this project in this appendix for ease of reading.
knitr::opts_chunk$set(echo = FALSE)
# Conditional Code Evaluation is managed here.
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(rpart.plot))
suppressPackageStartupMessages(library(partykit))
suppressPackageStartupMessages(library(gbm))
library(cowplot)
library(skimr)
library(GGally) # for ggpairs
library(readxl) # to parse Excel workbooks
library(openxlsx) # for writing xlsx files
library(corrplot)
library(RColorBrewer)
library(caret)
library(VSURF)
library(Cubist)
library(rpart)
library(party)
library(writexl)
.code-bg {
background-color: lightgray;
}
raw_StudentData <- read_excel("StudentData.xlsx")
dim(raw_StudentData)
skim(raw_StudentData) %>%
yank("numeric") %>%
arrange( complete_rate) %>%
kable( digits = 3 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(raw_StudentData) %>% yank("character") %>%
kable( caption = "Character Variables in Beverage Data" , digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
raw_StudentData %>%
rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>%
filter( !is.na(PH)) %>% as_tibble() -> sdata_v1
dim(sdata_v1)
# include any rows where any column is NA. Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id ) %>% relocate(id)-> missing_train_rows
dim(missing_train_rows)
missing_train_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_train_summary
missing_train_summary %>% group_by(num_na) %>%
summarize( count = n()) %>%
kable( caption = "Number of Rows with NA") %>%
kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells = sum( missing_train_summary$num_na)
num_cells = nrow(sdata_v1) * 32 # excludes the response variable PH since we dropped its values.
print(paste0("There are ", num_na_cells , " cells with NA values out of ", num_cells, " equivalent to: " , sprintf("%.2f%%", 100 * num_na_cells / num_cells ) ) )
raw_StudentEvaluation <- read_excel("StudentEvaluation.xlsx")
dim(raw_StudentEvaluation)
skim(raw_StudentEvaluation) %>%
yank("numeric") %>%
arrange( complete_rate) %>%
kable( digits = 3 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(raw_StudentEvaluation) %>% yank("character") %>%
kable( caption = "Character Variables in Beverage Data" , digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
raw_StudentEvaluation %>%
rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>% as_tibble() -> edata_v1
dim(edata_v1)
# include any rows where any column is NA. Always include id column
#
# NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
# for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id ) %>% relocate(id)-> missing_test_rows
dim(missing_test_rows)
missing_test_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_test_summary
missing_test_summary %>% group_by(num_na) %>%
summarize( count = n()) %>%
kable( caption = "Number of Rows with NA") %>%
kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells_test = sum( missing_test_summary$num_na)
num_cells_test = nrow(edata_v1) * 32 # excludes the response variable PH since we dropped its values.
print(paste0("There are ", num_na_cells_test , " cells with NA values (excluding PH) out of ", num_cells_test, " equivalent to: " ,
sprintf("%.2f%%", 100 * num_na_cells_test / num_cells_test ) ) )
skim_sdata = skim(sdata_v1)
skim_edata = skim(edata_v1)
skim_sdata %>% select( skim_variable, n_missing, complete_rate) %>% inner_join(skim_edata %>% select(skim_variable, n_missing, complete_rate), by = c("skim_variable") ) %>% filter(skim_variable != 'PH' ) -> skim_complete_rates
skim_complete_rates %>% rename( training_complete = complete_rate.x, testing_complete = complete_rate.y) -> skim_complete_rates
skim_complete_rates %>% as_tibble() %>% ggplot(aes(x=training_complete, y = testing_complete)) +
geom_point() + geom_smooth(method = "lm") + theme(aspect.ratio = 1) + lims( x = c(0.95, 1), y = c(0.95, 1)) +
labs(title = "Completion Rate by Predictor between Training and Test Sets")
# Feature plot for the numeric predictor variables against the result variable PH
cols <- sdata_v1 %>%
select(-c('id', 'BrandCode', 'PH')) %>% colnames()
featurePlot(sdata_v1[,cols],
sdata_v1$PH,
plot="scatter",
type = c("p", "smooth"),
span = .5,
layout=c(6,6))
#https://statisticsglobe.com/histogram-density-for-each-column-data-frame-r
data_long <- sdata_v1[,cols] %>%
pivot_longer(cols) %>%
as.data.frame()
ggp2 <- ggplot(data_long, aes(x = value)) +
geom_density() +
facet_wrap(~ name, scales="free") +
labs(title = "Density Plot of Predictor Variables")
ggp2
# http://www.sthda.com/english/wiki/normality-test-in-r
# From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
shap_test_res <- lapply(sdata_v1[,cols], shapiro.test)
# Result: None of the numeric variables are normally distributed.
# https://stackoverflow.com/questions/62306712/how-to-select-only-the-p-value-0-05-after-performing-shapiro-wilk-test-in-rstud
subset_vector <- sapply(shap_test_res, function(x) x$p.value > .05)
results_subset <- shap_test_res[subset_vector]
length(results_subset)
ggplot(data = sdata_v1) +
geom_bar(mapping = aes(x = `BrandCode`)) +
labs(title = "Distribution of Brand Code")
ggplot(data = sdata_v1, mapping = aes(x = `BrandCode`, y = PH)) +
geom_boxplot() +
labs(title = "Boxplot of Brand Code")
sdata_v1_no_nas <- sdata_v1 %>% drop_na()
bev.vsurf <- VSURF(sdata_v1_no_nas[,cols],
sdata_v1_no_nas$PH,
ntree = 10,
nfor.thres = 20,
nfor.interp = 10, nfor.pred = 10)
bev.vsurf
summary(bev.vsurf)
bev.vsurf$varselect.pred
set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )
sdata_test = sdata_v1[-train_indices , ]
sdata_train = sdata_v1[ train_indices , ]
sdataY_test = sdata_v1$PH[-train_indices ]
sdataY_train = sdata_v1$PH[ train_indices ]
nearZeroVar(sdata_train, saveMetrics = TRUE) %>%
kable(caption = "No Near Zero Variance Predictors in Training", digits = 2 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(sdata_train) %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
) %>%
filter( zscore_extreme > 4 , skim_variable != 'PH') %>%
arrange( desc(zscore_extreme)) %>%
dplyr::select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd, numeric.hist, zscore_min, zscore_max ) %>%
kable(digits = 2, caption = "Predictors with Outliers by Z-Score" ) %>%
kable_styling(bootstrap_options = c("hover" , "striped"), position = "left")
pl1 = sdata_train %>% ggplot(aes(x=PH, y = MFR)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl2 = sdata_train %>% ggplot(aes(x=PH, y = OxygenFiller)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl3 = sdata_train %>% ggplot(aes(x=PH, y = Temperature)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl4 = sdata_train %>% ggplot(aes(x=PH, y = CarbRel)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl5 = sdata_train %>% ggplot(aes(x=PH, y = AirPressurer)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl6 = sdata_train %>% ggplot(aes(x=PH, y = PSCCO2)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl7 = sdata_train %>% ggplot(aes(x=PH, y = FillPressure)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
plotlist= list(pl1, pl2, pl3, pl4, pl5, pl6, pl7)
plot_grid(plotlist=plotlist, nrow = 2 )
median_MFR = median( sdata_train$MFR , na.rm = TRUE )
sdata_train %>%
mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> sdata_v2_train
sdata_test %>%
mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> sdata_v2_test
edata_v1 %>%
mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> edata_v2
# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected. They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale", "knnImpute") )
# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc, as.data.frame(sdata_v2_train ) )
# Becomes the final version of test data for validation
sdata_test_pp = predict( preProcFunc, as.data.frame(sdata_v2_test) )
# Need to generate predictions based on test data without known response
edata_test_pp = predict( preProcFunc, as.data.frame(edata_v2) )
sdata_trainX_pp = sdata_train_pp %>% select( -PH)
sdata_testX_pp = sdata_test_pp %>% select(-PH)
edata_testX_pp = edata_test_pp %>% select( -PH)
skim(sdata_train_pp)
skim(sdata_test_pp)
skim(edata_test_pp)
M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )
Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )
readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )
readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )
readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )
file_df = tibble( filename = c("sdata_train_pp.csv", "sdata_test_pp.csv", "edata_test_pp.csv"),
primary_key = c("id - consistent with sdata_test_pp", "id - consistent with sdata_train_pp", "id - standalone") ,
response = c("PH - unchanged", "PH - unchanged", "PH - unchanged" ) ,
numeric_columns = c("BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed") ,
num_rows = c(2055, 512, 267 ) ,
file_source = c("StudentData.xlsx", "StudentData.xlsx", "StudentEvaluation.xlsx")
)
file_df %>% kable(caption = "Pre-Processed Data Files") %>% kable_styling(bootstrap_options = c("hover", "striped"
),
position = "left")
sdata_train_pp = read_csv("sdata_train_pp.csv",
col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
sdata_test_pp = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
sdata_trainX_pp = sdata_train_pp %>% select( -PH, -id )
sdata_testX_pp = sdata_test_pp %>% select(-PH , -id)
tuningFull = TRUE # Set to TRUE if running the lengthy tuning process for all models
set.seed(1027)
gbmControlFull <- trainControl(method = "repeatedcv",
number = 10, # for debug mode, set this to a low number (1)
repeats = 10 ,
selectionFunction = "best",
verboseIter = FALSE)
gbmControlLight <- trainControl(method = "repeatedcv",
number = 5, # for debug mode, set this to a low number (1)
repeats = 1 ,
selectionFunction = "best",
verboseIter = FALSE)
if(tuningFull == TRUE)
{
gbmControl = gbmControlFull
gbmTune = expand.grid( n.trees = seq(250, 1000, by = 250),
shrinkage = c( 0.02, 0.05, 0.1) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 15
)
} else {
gbmControl = gbmControlLight
gbmTune = expand.grid( n.trees = c( 2000),
shrinkage = c( 0.02, 0.05 , 0.1 ) ,
interaction.depth = c( 3, 7) ,
n.minobsinnode = 4
)
}
(gbmTune = caret::train( x = sdata_trainX_pp,
y = sdata_train_pp$PH ,
method = "gbm",
tuneGrid = gbmTune ,
verbose = FALSE,
metric = "RMSE" ,
trControl = gbmControl ) )
gbm_plot <- ggplot(gbmTune) + labs(title = "Gradient Boosted Trees - Tuning")
gbm_plot
ggsave("GradientBoostedTreesTurning.jpg", gbm_plot)
gbm_vi_plot <- ggplot(varImp(gbmTune, numTrees = gbmTune$finalModel$n.trees) ) + labs(title = "Gradient Boosted Trees - Variance Importance")
gbm_vi_plot
ggsave("GbmVarImp.jpg", gbm_vi_plot)
set.seed(1027)
cubistControlFull <- trainControl(method = "cv" , selectionFunction = "best")
tuneGridFull <- expand.grid( committees = c( 10, 50, 100 ) ,
neighbors = c( 0, 1, 5, 9 )
)
cubistControlLight <- trainControl(method = "cv" , selectionFunction = "best")
tuneGridLight <- expand.grid( committees = c( 10, 20 ) ,
neighbors = c( 0, 5 ) )
if(tuningFull == TRUE)
{
cubistControl = cubistControlFull
cubistGrid = tuneGridFull
} else {
cubistControl = cubistControlLight
cubistGrid = tuneGridLight
}
(cubistTune = caret::train( x = sdata_trainX_pp,
y = sdata_train_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
cubist_plot <- ggplot(cubistTune) + labs(title="Cubist Model Tuning")
cubist_plot
ggsave("CubistTuning.jpg", cubist_plot)
cubist_vi_plot <- ggplot(varImp(cubistTune), top = 20) + labs(title = "Cubist Model Variable Importance")
cubist_vi_plot
ggsave("CubistVarImp.jpg", cubist_vi_plot)
cubistTune$finalModel$usage %>% arrange( desc(Model+ Conditions)) %>%
kable(caption = "Cubist Predictor Usage") %>% kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
cubist_splitsplot <- dotplot( cubistTune$finalModel, what = "splits" )
cubist_splitsplot
# Output to file
jpeg(file="CubistSplitsPlot.jpg")
cubist_splitsplot
dev.off()
cubist_dotplot <- dotplot( cubistTune$finalModel,
what = "coefs",
between = list(x = 0.2, y = 0.5) ,
scales = list(x = list(relation = "free"),
y = list(cex = 0.25) ) )
cubist_dotplot
jpeg(file="CubistDotPlot.jpg")
cubist_splitsplot
dev.off()
cubistTune$finalModel$coefficients %>%
rownames_to_column(var="id") %>%
select(-committee, -rule) %>%
pivot_longer(!id, names_to = "predictor", values_to = "coef" ) %>%
filter( !is.na(coef)) %>%
filter( predictor != '(Intercept)' ) %>%
filter(abs(coef) < 20 ) -> coef_piv
cubist_boxplot <- coef_piv %>%
ggplot() +
geom_boxplot(
aes(x=reorder(predictor, coef, FUN = median, na.rm = TRUE ), # order the boxplots by median
y = coef ), outlier.shape = NA ) + # strip out the outliers
coord_flip(ylim=c(-.5,.5)) +
labs(title = "Distribution of Coefficients of Predictors",
subtitle = "from Cubist Committee/Rules",
x = "Predictors" ,
y = "Coefficient in Linear Model"
)
cubist_boxplot
ggsave("CubistBoxPlot.jpg", cubist_boxplot)
set.seed(10393)
library(rpart)
cartGrid = expand.grid( maxdepth = seq( 1, 20, by = 1 ) )
cartControl <- trainControl(method = "boot",
number = 30,
selectionFunction = "best")
(rpartTune = train( x = sdata_trainX_pp,
y = sdata_train_pp$PH ,
method = "rpart2" ,
metric = "RMSE" ,
tuneGrid = cartGrid ,
trControl = cartControl ) )
cart_tuning_plot <- ggplot(rpartTune) + labs(title="CART Tree Tuning")
cart_tuning_plot
ggsave("CartTuningPlot.jpg", cart_tuning_plot)
cart_vi_plot <- ggplot(varImp( rpartTune), top = 20 ) + labs(title = "CART Tree Variable Importance")
cart_vi_plot
ggsave("CartVarImp.jpg", cart_vi_plot)
rpart.plot(rpartTune$finalModel, digits = 3)
if( tuningFull == TRUE )
{
marsGrid = expand.grid( .degree = 1:2, .nprune = seq(2,27, by=5) )
marsControl = trainControl(method = "cv",
number = 10,
selectionFunction = "best",
verboseIter = FALSE)
} else {
marsGrid = expand.grid( .degree = 1:2, .nprune = seq(2,25, by = 10 ) )
marsControl = trainControl(method = "cv",
number = 10,
selectionFunction = "best",
verboseIter = FALSE)
}
set.seed(1000)
marsTuned = caret::train(x = sdata_trainX_pp ,
y = sdata_train_pp$PH,
method = "earth" ,
metric = "RMSE" ,
tuneGrid = marsGrid ,
trControl = marsControl )
marsTuned$finalModel
summary(marsTuned)
mars_tuning_plot <- ggplot(marsTuned) + labs(title="MARS Model Tuning" )
mars_tuning_plot
ggsave("MarsTuningPlot.jpg", mars_tuning_plot)
mars_vi_plot <- ggplot( varImp(marsTuned), 20 ) + labs(title = "MARS Model Variable Importance")
mars_vi_plot
ggsave("MarsVarImp.jpg", mars_vi_plot)
set.seed(123)
train.control <- trainControl( method = "cv", number = 5)
linearTune = train( PH ~ . -id,
data = sdata_train_pp,
method = "lm",
tuneGrid = data.frame(intercept = TRUE),
trControl = train.control)
summary(linearTune$finalModel)
par(mfrow=c(2,2))
lm_final_model_plot <- plot(linearTune$finalModel)
lm_final_model_plot
#ggsave("LinearModelFinalPlot.jpg", lm_final_model_plot)
hist(linearTune$finalModel$residuals)
# Make a list of model results collect them from caret for both training and test data and compare them jointly in a kable()
model_list = list( rpartTune , marsTuned , gbmTune, cubistTune, linearTune)
model_stats = data.frame()
for( modelObj in model_list )
{
if( modelObj$method != 'lm' ){
pred_data <- as.numeric( predict( modelObj, newdata=sdata_testX_pp ))[]
}
else
{
pred_data <- as.numeric( predict( modelObj, newdata=sdata_test_pp ))
}
output <- data.frame( modelName = modelObj$method ,
trainRMSE = getTrainPerf(modelObj)[1,1] ,
trainR2 = getTrainPerf(modelObj)[1,2] ,
testRMSE = caret::RMSE( pred_data, sdata_test_pp$PH) ,
testR2 = caret::R2( pred_data, sdata_test_pp$PH)
)
model_stats <- rbind(model_stats, output)
}
results_table <- model_stats %>% as_tibble() %>%
arrange( testRMSE) %>%
kable( digits = 3, caption = "Model Results Comparison") %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
results_table
save_kable(results_table, "ModelResultsTable.jpg")
predVals = extractPrediction(list(marsTuned, rpartTune, cubistTune, gbmTune), testX = sdata_testX_pp , testY = sdata_test_pp$PH )
# Attempting to hack into the extract prediction object with the LM data
# Cols: obs pred model dataType object
pred_lm = predict(linearTune, newdata = sdata_test_pp)
pred_lm_train <- linearTune$finalModel$fitted.values
# Create placeholder data for the training values
model_l <- rep("LM", length = 2055)
dt_l <- rep("Training", length = 2055)
obj_l <- rep("Object9", length = 2055)
# Create dataframe of the training data of LM
lm_train_preds <- as.data.frame(cbind(obs=sdata_train_pp$PH, pred=pred_lm_train, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)
# Create placeholder data for the test values
model_l <- rep("LM", length = 512)
dt_l <- rep("Test", length = 512)
obj_l <- rep("Object10", length = 512)
# Create dataframe of the test data of LM
lm_test_preds <- as.data.frame(cbind(obs=sdata_test_pp$PH, pred=pred_lm, model=model_l, dataType=dt_l, object=obj_l), row.names = NULL)
# Combine all data for plotting
all_preds <- rbind(predVals, lm_train_preds, lm_test_preds)
# Ensure obs and preds are numerics
all_preds$obs <- as.numeric(all_preds$obs)
all_preds$pred <- as.numeric(all_preds$pred)
# Plot using Caret built-in reporting.
# --------------------------------------
obsVsPredsPlot <- plotObsVsPred(all_preds)
obsVsPredsPlot
#ggsave("ObsVsPredsPlot.jpg", obsVsPredsPlot)
# Combine the initial training and test datasets into 1
sdata_trainfull_pp <- rbind(sdata_train_pp, sdata_test_pp)
sdata_trainfullX_pp <- sdata_trainfull_pp %>% select(-PH , -id)
edata_test_pp = read_csv("edata_test_pp.csv",
col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>%
as_tibble() %>% remove_rownames()
# Train the Cubist model
set.seed(1027)
if(tuningFull == TRUE)
{
cubistControl = cubistControlFull
cubistGrid = tuneGridFull
} else {
cubistControl = cubistControlLight
cubistGrid = tuneGridLight
}
(cubistTune = caret::train( x = sdata_trainfullX_pp,
y = sdata_trainfull_pp$PH ,
method = "cubist",
tuneGrid = cubistGrid ,
verbose = FALSE,
metric = "RMSE" ,
trControl = cubistControl ) )
# Make predictions on the provided dataset
final_pred_data <- as.numeric( predict( cubistTune, newdata=edata_test_pp ))
# Output prediction performance
full_cubist_train_rmse <- getTrainPerf(cubistTune)[1,1]
full_cubist_train_r2 <- getTrainPerf(cubistTune)[1,2]
# Output predictions to Excel file
raw_StudentEvaluation <- read_excel("StudentEvaluation.xlsx")
raw_StudentEvaluation$PH <- final_pred_data
write_xlsx(raw_StudentEvaluation, "ANg_PTanofsky_Ph_Predictions.xlsx")