DATA 624 - PROJECT 2
1 Introduction
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
2 Load Package
The following R package are used in this project.
3 Load Data
Two data sets are downloaded from Github
- Training Data:
StudentData.xlsx - Evaluation Data:
StudentEvaluation.xlsx
df<-rio::import('https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project2/StudentData.xlsx')
df_eval <-rio::import('https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project2/StudentEvaluation.xlsx')
df<-data.frame(df)
df_eval<-data.frame(df_eval)
head(df)4 Exploratory data analysis
According to the data summary below,
- The responsible variable
[PH]is continuous, therefore regression model is expected to be built. - There are total 31 numerical predictors and 1 categorical predictor in the data set.
- According to the missing value view, only 1% of the data are missing, the predictor that contains most missing value is [MFR], this missing ratio is 212/2571 = 8.25%. Therefore no predictor is suggested to be removed, imputation is to be included in the later data preprocess.
- There are 4 rows in the training set which
[PH]is missing, as imputing responsible variable is not meaningful in training set, therefore these 4 rows are suggested to be removed. - The majority of the continuous numerical predictors in both training set and evaluation set demonstrated skewed distribution, also some of the predictors contain negative values, therefore
Yeo-Johnsontransformation is used to remove the skewness. - A dummy variable will be created for categorical predictor
[Brand.Code]. - The pairwise correlation of predictors
[Balling],[Hyd.Pressure3],[Density],[Balling.Lvl]and[Filler.Level], after missing value imputation, are greater than 0.9, therefore, they are suggested to be removed to avoid multicollinearity.
4.1 Training Data Summary
| Name | df |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand.Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb.Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill.Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC.Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb.Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb.Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC.Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC.CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf.Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb.Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill.Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd.Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd.Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd.Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd.Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler.Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler.Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage.cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb.Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure.Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen.Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl.Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure.Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air.Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch.Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb.Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling.Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
4.2 Evaluation Data Summary
| Name | df_eval |
| Number of rows | 267 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 1 |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand.Code | 8 | 0.97 | 1 | 1 | 0 | 4 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| PH | 267 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb.Volume | 1 | 1.00 | 5.37 | 0.11 | 5.15 | 5.29 | 5.34 | 5.47 | 5.67 | ▂▇▃▅▁ |
| Fill.Ounces | 6 | 0.98 | 23.97 | 0.08 | 23.75 | 23.92 | 23.97 | 24.01 | 24.20 | ▁▅▇▃▁ |
| PC.Volume | 4 | 0.99 | 0.28 | 0.06 | 0.10 | 0.23 | 0.28 | 0.32 | 0.46 | ▁▆▇▅▁ |
| Carb.Pressure | 0 | 1.00 | 68.25 | 3.86 | 60.20 | 65.30 | 68.00 | 70.60 | 77.60 | ▃▆▇▃▂ |
| Carb.Temp | 1 | 1.00 | 141.23 | 4.30 | 130.00 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▆▇▃▁ |
| PSC | 5 | 0.98 | 0.09 | 0.05 | 0.00 | 0.04 | 0.08 | 0.11 | 0.25 | ▆▇▃▂▁ |
| PSC.Fill | 3 | 0.99 | 0.19 | 0.11 | 0.02 | 0.10 | 0.18 | 0.26 | 0.62 | ▇▇▃▁▁ |
| PSC.CO2 | 5 | 0.98 | 0.05 | 0.04 | 0.00 | 0.02 | 0.04 | 0.06 | 0.24 | ▇▃▂▁▁ |
| Mnf.Flow | 0 | 1.00 | 21.03 | 117.76 | -100.20 | -100.00 | 0.20 | 141.30 | 220.40 | ▇▁▁▆▂ |
| Carb.Pressure1 | 4 | 0.99 | 123.04 | 4.42 | 113.00 | 120.20 | 123.40 | 125.50 | 136.00 | ▃▃▇▂▁ |
| Fill.Pressure | 2 | 0.99 | 48.14 | 3.44 | 37.80 | 46.00 | 47.80 | 50.20 | 60.20 | ▁▇▇▂▁ |
| Hyd.Pressure1 | 0 | 1.00 | 12.01 | 13.53 | -50.00 | 0.00 | 10.40 | 20.40 | 50.00 | ▁▁▇▆▂ |
| Hyd.Pressure2 | 1 | 1.00 | 20.11 | 17.21 | -50.00 | 0.00 | 26.80 | 34.80 | 61.40 | ▁▁▆▇▁ |
| Hyd.Pressure3 | 1 | 1.00 | 19.61 | 16.56 | -50.00 | 0.00 | 27.70 | 33.00 | 49.20 | ▁▁▆▃▇ |
| Hyd.Pressure4 | 4 | 0.99 | 97.84 | 13.92 | 68.00 | 90.00 | 98.00 | 104.00 | 140.00 | ▅▆▇▂▁ |
| Filler.Level | 2 | 0.99 | 110.29 | 15.50 | 69.20 | 100.60 | 118.60 | 120.20 | 153.20 | ▂▃▇▇▁ |
| Filler.Speed | 10 | 0.96 | 3581.39 | 911.19 | 1006.00 | 3812.00 | 3978.00 | 3996.00 | 4020.00 | ▁▁▁▁▇ |
| Temperature | 2 | 0.99 | 66.23 | 1.69 | 63.80 | 65.40 | 65.80 | 66.60 | 75.40 | ▇▅▁▁▁ |
| Usage.cont | 2 | 0.99 | 20.90 | 3.00 | 12.90 | 18.12 | 21.44 | 23.74 | 24.60 | ▁▃▃▃▇ |
| Carb.Flow | 0 | 1.00 | 2408.64 | 1161.36 | 0.00 | 1083.00 | 3038.00 | 3215.00 | 3858.00 | ▂▃▁▆▇ |
| Density | 1 | 1.00 | 1.18 | 0.38 | 0.06 | 0.92 | 0.98 | 1.60 | 1.84 | ▁▁▇▁▅ |
| MFR | 31 | 0.88 | 697.80 | 96.40 | 15.60 | 707.00 | 724.60 | 731.45 | 784.80 | ▁▁▁▁▇ |
| Balling | 1 | 1.00 | 2.20 | 0.92 | 0.90 | 1.50 | 1.65 | 3.24 | 3.79 | ▅▇▁▂▅ |
| Pressure.Vacuum | 1 | 1.00 | -5.17 | 0.58 | -6.40 | -5.60 | -5.20 | -4.80 | -3.60 | ▁▇▆▃▁ |
| Oxygen.Filler | 3 | 0.99 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.05 | 0.40 | ▇▁▁▁▁ |
| Bowl.Setpoint | 1 | 1.00 | 109.62 | 15.02 | 70.00 | 100.00 | 120.00 | 120.00 | 130.00 | ▁▂▁▃▇ |
| Pressure.Setpoint | 2 | 0.99 | 47.73 | 2.06 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air.Pressurer | 1 | 1.00 | 142.83 | 1.23 | 141.20 | 142.20 | 142.60 | 142.80 | 147.20 | ▅▇▁▁▁ |
| Alch.Rel | 3 | 0.99 | 6.91 | 0.50 | 6.40 | 6.54 | 6.58 | 7.18 | 7.82 | ▇▁▂▁▃ |
| Carb.Rel | 2 | 0.99 | 5.44 | 0.13 | 5.18 | 5.34 | 5.40 | 5.56 | 5.74 | ▂▇▂▃▂ |
| Balling.Lvl | 0 | 1.00 | 2.05 | 0.88 | 0.00 | 1.38 | 1.48 | 3.08 | 3.42 | ▁▃▇▁▇ |
4.4 Numerical Predictor Correlation after Missing Data Imputation
- Using KNN to impute missing values of the training data set
- compute pair-wise correlations and locate the predictors with pair-wire correlation greate than 0.9
findCorrelation(df %>%
kNN() %>%
select(!ends_with('imp'), -c(Brand.Code, PH)) %>%
cor(),
cutoff = 0.9,
names = TRUE,
verbose = TRUE)## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion
## Compare row 23 and column 21 with corr 0.955
## Means: 0.248 vs 0.154 so flagging column 23
## Compare row 14 and column 13 with corr 0.925
## Means: 0.246 vs 0.147 so flagging column 14
## Compare row 21 and column 31 with corr 0.948
## Means: 0.21 vs 0.141 so flagging column 21
## Compare row 31 and column 29 with corr 0.921
## Means: 0.18 vs 0.136 so flagging column 31
## Compare row 16 and column 26 with corr 0.946
## Means: 0.189 vs 0.133 so flagging column 16
## All correlations <= 0.9
## [1] "Balling" "Hyd.Pressure3" "Density" "Balling.Lvl"
## [5] "Filler.Level"
5 Data Preprocess
For training set:
- Remove rows where PH is empty/NA.
- perform train-test-split, ratio 4/5.
For both training and evaluation set:
- Impute missing values using bag trees
- create dummy variable for categorical variables
- center and scale numerical variables
- remove skewness of numerical variables
- remove predictors with near zero variance
- remove predictors with correlation greater than 0.9
Note: Data preprocess can be performed during model training, however, as there are multiple models to be built in the later section, preprocessing data in advanced is more efficient than doing it during each model run.
set.seed(0)
# -- remove is.na(PH)
df <- df %>%
filter(!is.na(PH))
# -- data preprocess
data_prepProc <- recipe(PH ~ ., df) %>%
#Impute missing value
step_bagimpute(all_predictors()) %>%
# create dummy variable for categorical variables
step_dummy(all_nominal(), -all_outcomes()) %>%
# center and scale
step_normalize(all_numeric(), -all_outcomes()) %>%
# remove skewness
step_YeoJohnson(all_nominal(), -all_outcomes()) %>%
# remove near zero variance predictors
step_nzv(all_nominal(), -all_outcomes()) %>%
# remove predictors with correlation > 0.9
step_corr(all_numeric(), -all_outcomes()) %>%
prep()
df_mod <- data_prepProc %>%
bake(df)
df_eval_mod <- data_prepProc %>%
bake(df_eval)
# train-test-split
df_split <- df_mod %>% initial_split(prop = 4/5)
# Training set
data_train_X <- training(df_split) %>% select(-PH)
data_train_Y <- training(df_split) %>% .$PH
# Testing set
data_test_X <- testing(df_split) %>% select(-PH)
data_test_Y <- testing(df_split) %>% .$PH
# Evaluation Set
data_eval_X <- df_eval_mod %>% select(-PH)
skim(df_mod)| Name | df_mod |
| Number of rows | 2567 |
| Number of columns | 29 |
| _______________________ | |
| Column type frequency: | |
| numeric | 29 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb.Volume | 0 | 1 | 0.00 | 1.00 | -3.11 | -0.72 | -0.22 | 0.81 | 3.10 | ▁▆▇▅▁ |
| Fill.Ounces | 0 | 1 | 0.00 | 1.00 | -3.93 | -0.63 | -0.02 | 0.60 | 3.97 | ▁▂▇▂▁ |
| PC.Volume | 0 | 1 | 0.00 | 1.00 | -3.28 | -0.63 | -0.10 | 0.58 | 3.31 | ▁▃▇▂▁ |
| Carb.Pressure | 0 | 1 | 0.00 | 1.00 | -3.16 | -0.74 | 0.00 | 0.67 | 3.15 | ▁▅▇▃▁ |
| Carb.Temp | 0 | 1 | 0.00 | 1.00 | -3.08 | -0.67 | -0.08 | 0.66 | 3.17 | ▁▅▇▃▁ |
| PSC | 0 | 1 | 0.00 | 1.00 | -1.69 | -0.71 | -0.14 | 0.56 | 3.78 | ▆▇▃▁▁ |
| PSC.Fill | 0 | 1 | 0.00 | 1.00 | -1.67 | -0.81 | -0.13 | 0.55 | 3.62 | ▆▇▃▁▁ |
| PSC.CO2 | 0 | 1 | 0.00 | 1.00 | -1.32 | -0.85 | -0.39 | 0.55 | 4.29 | ▇▅▂▁▁ |
| Mnf.Flow | 0 | 1 | 0.00 | 1.00 | -1.04 | -1.04 | 0.38 | 0.97 | 1.71 | ▇▁▁▇▂ |
| Carb.Pressure1 | 0 | 1 | 0.00 | 1.00 | -3.60 | -0.76 | 0.14 | 0.60 | 3.75 | ▁▃▇▂▁ |
| Fill.Pressure | 0 | 1 | 0.00 | 1.00 | -4.19 | -0.60 | -0.48 | 0.66 | 3.93 | ▁▁▇▂▁ |
| Hyd.Pressure1 | 0 | 1 | 0.00 | 1.00 | -1.06 | -1.00 | -0.08 | 0.63 | 3.67 | ▇▅▂▁▁ |
| Hyd.Pressure2 | 0 | 1 | 0.00 | 1.00 | -1.27 | -1.27 | 0.47 | 0.84 | 2.35 | ▇▂▇▅▁ |
| Hyd.Pressure4 | 0 | 1 | 0.00 | 1.00 | -2.62 | -0.80 | -0.04 | 0.42 | 3.46 | ▂▆▇▂▁ |
| Temperature | 0 | 1 | 0.00 | 1.00 | -1.71 | -0.56 | -0.27 | 0.30 | 7.34 | ▇▃▁▁▁ |
| Usage.cont | 0 | 1 | 0.00 | 1.00 | -3.00 | -0.88 | 0.26 | 0.92 | 1.65 | ▁▃▅▃▇ |
| Carb.Flow | 0 | 1 | 0.00 | 1.00 | -2.29 | -1.22 | 0.52 | 0.67 | 2.46 | ▂▅▆▇▁ |
| MFR | 0 | 1 | 0.00 | 1.00 | -5.13 | 0.17 | 0.38 | 0.45 | 1.55 | ▁▁▁▂▇ |
| Pressure.Vacuum | 0 | 1 | 0.00 | 1.00 | -2.43 | -0.67 | -0.32 | 0.38 | 2.83 | ▂▇▅▃▁ |
| Oxygen.Filler | 0 | 1 | 0.00 | 1.00 | -0.98 | -0.55 | -0.29 | 0.30 | 7.83 | ▇▁▁▁▁ |
| Bowl.Setpoint | 0 | 1 | 0.00 | 1.00 | -2.57 | -0.61 | 0.70 | 0.70 | 2.00 | ▁▂▃▇▁ |
| Pressure.Setpoint | 0 | 1 | 0.00 | 1.00 | -1.77 | -0.79 | -0.79 | 1.17 | 2.16 | ▁▇▁▆▁ |
| Air.Pressurer | 0 | 1 | 0.00 | 1.00 | -1.68 | -0.52 | -0.19 | 0.14 | 4.42 | ▅▇▁▁▁ |
| Alch.Rel | 0 | 1 | 0.00 | 1.00 | -3.20 | -0.71 | -0.67 | 0.66 | 3.41 | ▁▇▂▃▁ |
| Carb.Rel | 0 | 1 | 0.00 | 1.00 | -3.70 | -0.75 | -0.28 | 0.80 | 4.85 | ▁▇▆▂▁ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Brand.Code_B | 0 | 1 | 0.00 | 1.00 | -1.00 | -1.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| Brand.Code_C | 0 | 1 | 0.00 | 1.00 | -0.41 | -0.41 | -0.41 | -0.41 | 2.43 | ▇▁▁▁▂ |
| Brand.Code_D | 0 | 1 | 0.00 | 1.00 | -0.56 | -0.56 | -0.56 | -0.56 | 1.78 | ▇▁▁▁▂ |
6 Model building
Three categories of regression models are to be built in this section, including Linear Regression Models, Non-linear Regression Models and Tree-based Models. The model with best performance in the test data set will be selected as the final model.
The models to be built are as below:
- Linear Regression Models:
PLS,Ridge,LASSOandElastic Net - Non-linear Regression Models:
KNN,SVM-Linear,SVM-Radial,MARSandNeural Network - Tree-based Regression Models:
Random Forest,Gradient Boosting MachineandCubist
6.1 Linear Regression Models
6.1.1 PLS Regression
7th latent variables are optimal;
The corresponding resampled estimate of RMSE and R2 are 0.1362656 and 0.3739715 respectively.
set.seed(0)
ctrl <- trainControl(method = "cv", number = 10)
Linear_PLS <- train(data_train_X, data_train_Y,
method = 'pls',
tuneLength = 20,
trControl = ctrl)
Linear_PLS ## Partial Least Squares
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1497005 0.2470540 0.1176600
## 2 0.1430215 0.3139339 0.1116965
## 3 0.1413576 0.3297154 0.1108805
## 4 0.1396517 0.3458175 0.1093216
## 5 0.1390031 0.3516492 0.1085059
## 6 0.1384918 0.3566973 0.1080004
## 7 0.1384305 0.3573092 0.1081537
## 8 0.1384597 0.3570316 0.1080082
## 9 0.1385041 0.3566531 0.1080056
## 10 0.1385358 0.3563692 0.1080224
## 11 0.1385680 0.3560643 0.1080587
## 12 0.1385836 0.3559539 0.1080834
## 13 0.1385914 0.3558839 0.1080780
## 14 0.1385636 0.3561045 0.1080470
## 15 0.1385706 0.3560451 0.1080609
## 16 0.1385782 0.3559804 0.1080730
## 17 0.1385796 0.3559731 0.1080733
## 18 0.1385952 0.3558288 0.1080827
## 19 0.1386021 0.3557690 0.1080826
## 20 0.1386004 0.3557876 0.1080814
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.
Linear_PLS_pred <- predict(Linear_PLS, data_train_X)
Linear_PLS_metrics <- postResample(Linear_PLS_pred, data_train_Y)
Linear_PLS_metrics## RMSE Rsquared MAE
## 0.1362656 0.3739715 0.1064367
6.1.2 Ridge Regression
lambda = 0.03157895 is optimal;
The corresponding resampled estimate of RMSE and R2 are 0.1299868 and 0.4415918 respectively.
set.seed(0)
ctrl <- trainControl(method = "cv", number = 10)
ridgeGrid <- data.frame(.lambda = seq(0, .2, length = 20))
Linear_Ridge <- train(data_train_X, data_train_Y,
method = 'ridge',
tuneGrid = ridgeGrid,
trControl = ctrl)
Linear_Ridge## Ridge Regression
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00000000 0.1386059 0.3557400 0.1080834
## 0.01052632 0.1385244 0.3564372 0.1080449
## 0.02105263 0.1384937 0.3566985 0.1080301
## 0.03157895 0.1384906 0.3567237 0.1080267
## 0.04210526 0.1385055 0.3565978 0.1080296
## 0.05263158 0.1385331 0.3563667 0.1080395
## 0.06315789 0.1385701 0.3560587 0.1080534
## 0.07368421 0.1386146 0.3556927 0.1080730
## 0.08421053 0.1386650 0.3552820 0.1081018
## 0.09473684 0.1387202 0.3548367 0.1081358
## 0.10526316 0.1387795 0.3543642 0.1081742
## 0.11578947 0.1388424 0.3538704 0.1082172
## 0.12631579 0.1389082 0.3533599 0.1082615
## 0.13684211 0.1389767 0.3528364 0.1083094
## 0.14736842 0.1390475 0.3523031 0.1083598
## 0.15789474 0.1391204 0.3517622 0.1084124
## 0.16842105 0.1391953 0.3512160 0.1084677
## 0.17894737 0.1392719 0.3506661 0.1085256
## 0.18947368 0.1393501 0.3501139 0.1085829
## 0.20000000 0.1394298 0.3495607 0.1086415
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.03157895.
Linear_Ridge_Pred <- predict(Linear_Ridge, newdata = data_test_X)
Linear_Ridge_metrics <- postResample(pred = Linear_Ridge_Pred, obs = data_test_Y)
Linear_Ridge_metrics## RMSE Rsquared MAE
## 0.1299868 0.4415918 0.1021300
6.1.3 LASSO
The Optimal fraction is 0.1,
The corresponding resampled estimate of RMSE and R2 are 0.1561285 and 0.2961838 respectively.
set.seed(0)
lassoGrid <- data.frame(.fraction = seq(0.01, .1, length = 20))
Linear_LASSO <- train(data_train_X, data_train_Y,
method = 'lasso',
tuneGrid = lassoGrid,
trControl = ctrl)
Linear_LASSO## The lasso
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.01000000 0.1702219 0.1939752 0.1358928
## 0.01473684 0.1693460 0.1939752 0.1350627
## 0.01947368 0.1684926 0.1939752 0.1342976
## 0.02421053 0.1676620 0.1939752 0.1335504
## 0.02894737 0.1668545 0.1939752 0.1328056
## 0.03368421 0.1660705 0.1939752 0.1321218
## 0.03842105 0.1653103 0.1939752 0.1314585
## 0.04315789 0.1645743 0.1939752 0.1308063
## 0.04789474 0.1638627 0.1939752 0.1301621
## 0.05263158 0.1631759 0.1939752 0.1295284
## 0.05736842 0.1625142 0.1939752 0.1289066
## 0.06210526 0.1619037 0.1954704 0.1283338
## 0.06684211 0.1613301 0.1989758 0.1277900
## 0.07157895 0.1607555 0.2047889 0.1272578
## 0.07631579 0.1601689 0.2114635 0.1267473
## 0.08105263 0.1595945 0.2174196 0.1262680
## 0.08578947 0.1590325 0.2227269 0.1257959
## 0.09052632 0.1584829 0.2274517 0.1253301
## 0.09526316 0.1579511 0.2316237 0.1248748
## 0.10000000 0.1574345 0.2354022 0.1244261
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.
Linear_LASSO_Pred <- predict(Linear_LASSO, newdata = data_test_X)
Linear_LASSO_metrics <- postResample(pred = Linear_LASSO_Pred, obs = data_test_Y)
Linear_LASSO_metrics## RMSE Rsquared MAE
## 0.1561285 0.2961838 0.1274395
6.1.4 Elastic Net
The optimal fraction = 0.1 and lambda = 0.2,
The corresponding resampled estimate of RMSE and R2 are 0.1589297 and 0.2697740 respectively.
set.seed(0)
enetGrid <- data.frame(.lambda = seq(0, .2, length = 20),
.fraction = seq(0.01, .1, length = 20))
Linear_eNet <- train(data_train_X, data_train_Y,
method = 'enet',
tuneGrid = enetGrid,
trControl = ctrl)
Linear_eNet## Elasticnet
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## lambda fraction RMSE Rsquared MAE
## 0.00000000 0.01000000 0.1702219 0.1939752 0.1358928
## 0.01052632 0.01473684 0.1694478 0.1939752 0.1351553
## 0.02105263 0.01947368 0.1687105 0.1939752 0.1344917
## 0.03157895 0.02421053 0.1680056 0.1939752 0.1338616
## 0.04210526 0.02894737 0.1673297 0.1939752 0.1332466
## 0.05263158 0.03368421 0.1666792 0.1939752 0.1326442
## 0.06315789 0.03842105 0.1660529 0.1939752 0.1321087
## 0.07368421 0.04315789 0.1654493 0.1939752 0.1315835
## 0.08421053 0.04789474 0.1648661 0.1939752 0.1310710
## 0.09473684 0.05263158 0.1643026 0.1939752 0.1305686
## 0.10526316 0.05736842 0.1637586 0.1939752 0.1300747
## 0.11578947 0.06210526 0.1632339 0.1939752 0.1295900
## 0.12631579 0.06684211 0.1627265 0.1939752 0.1291161
## 0.13684211 0.07157895 0.1622458 0.1942561 0.1286614
## 0.14736842 0.07631579 0.1618001 0.1953714 0.1282402
## 0.15789474 0.08105263 0.1613795 0.1982185 0.1278414
## 0.16842105 0.08578947 0.1609593 0.2025491 0.1274507
## 0.17894737 0.09052632 0.1605323 0.2074739 0.1270651
## 0.18947368 0.09526316 0.1601179 0.2122214 0.1266996
## 0.20000000 0.10000000 0.1597176 0.2167320 0.1263679
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1 and lambda = 0.2.
Linear_eNet_Pred <- predict(Linear_eNet, newdata = data_test_X)
Linear_eNet_metrics <- postResample(pred = Linear_eNet_Pred, obs = data_test_Y)
Linear_eNet_metrics## RMSE Rsquared MAE
## 0.1589297 0.2697740 0.1299668
6.2 Non-Linear Regression Models
6.2.1 KNN
The optimal k is 7;
The corresponding resampled estimate of RMSE and R2 are 0.10585060 and 0.62857413 respectively.
set.seed(0)
NonLinear_KNN <- train(data_train_X, data_train_Y,
method = 'knn',
tuneLength = 10,
trControl = ctrl)
NonLinear_KNN## k-Nearest Neighbors
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1257029 0.4775757 0.09351756
## 7 0.1237475 0.4906292 0.09276375
## 9 0.1242006 0.4868828 0.09366748
## 11 0.1258387 0.4745378 0.09549822
## 13 0.1263061 0.4712000 0.09587242
## 15 0.1274855 0.4620434 0.09716663
## 17 0.1284044 0.4544409 0.09826715
## 19 0.1287749 0.4513034 0.09857713
## 21 0.1292793 0.4471276 0.09919209
## 23 0.1298352 0.4422596 0.09962648
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
NonLinear_KNN_pred <- predict(NonLinear_KNN, data_train_X)
NonLinear_KNN_metrics <- postResample(NonLinear_KNN_pred, data_train_Y)
NonLinear_KNN_metrics## RMSE Rsquared MAE
## 0.10585060 0.62857413 0.07894874
6.2.2 SVM-Linear
The optimal epsilon = 0.1 and cost C = 1;
The corresponding resampled estimate of RMSE and R2 are 0.1381481 and 0.3615830 respectively.
set.seed(0)
NonLinear_SVMLinear <- train(data_train_X, data_train_Y,
method = 'svmLinear',
tuneLength = 15,
trControl = ctrl)
NonLinear_SVMLinear## Support Vector Machines with Linear Kernel
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1405161 0.3452223 0.1072494
##
## Tuning parameter 'C' was held constant at a value of 1
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 1831
##
## Objective Function Value : -1053.426
## Training error : 0.643132
NonLinear_SVMLinear_pred <- predict(NonLinear_SVMLinear, data_train_X)
NonLinear_SVMLinear_metrics <- postResample(NonLinear_SVMLinear_pred, data_train_Y)
NonLinear_SVMLinear_metrics## RMSE Rsquared MAE
## 0.1381481 0.3615830 0.1045695
6.2.3 SVM-Radial
The optimal sigma = 0.0242724 and C = 4;
The corresponding resampled estimate of RMSE and R2 are 0.08011998 and 0.79263724 respectively.
set.seed(0)
NonLinear_SVMRadial <- train(data_train_X, data_train_Y,
method = 'svmRadial',
tuneLength = 15,
trControl = ctrl)
NonLinear_SVMRadial## Support Vector Machines with Radial Basis Function Kernel
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1286431 0.4526820 0.09577483
## 0.50 0.1256923 0.4758057 0.09278004
## 1.00 0.1231104 0.4952829 0.09035109
## 2.00 0.1210941 0.5106732 0.08867772
## 4.00 0.1204826 0.5158644 0.08851988
## 8.00 0.1212283 0.5141755 0.08924725
## 16.00 0.1224728 0.5116971 0.09033769
## 32.00 0.1258334 0.4986777 0.09296903
## 64.00 0.1326503 0.4687005 0.09806454
## 128.00 0.1389973 0.4449388 0.10296902
## 256.00 0.1452495 0.4218464 0.10818173
## 512.00 0.1510565 0.4016687 0.11316640
## 1024.00 0.1519305 0.3984248 0.11383537
## 2048.00 0.1519305 0.3984248 0.11383537
## 4096.00 0.1519305 0.3984248 0.11383537
##
## Tuning parameter 'sigma' was held constant at a value of 0.0242724
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0242724 and C = 4.
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 4
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0242723997688406
##
## Number of Support Vectors : 1748
##
## Objective Function Value : -2289.491
## Training error : 0.216318
NonLinear_SVMRadial_pred <- predict(NonLinear_SVMRadial, data_train_X)
NonLinear_SVMRadial_metrics <- postResample(NonLinear_SVMRadial_pred, data_train_Y)
NonLinear_SVMRadial_metrics## RMSE Rsquared MAE
## 0.08011998 0.79263724 0.05028598
6.2.4 MARS
The optimal nprune = 23 and degree = 2.
The corresponding resampled estimate of RMSE and R2 are 0.12396741 and 0.49036903 respectively.
set.seed(0)
NonLinear_MARS <- train(data_train_X, data_train_Y,
method ='earth',
tuneGrid = expand.grid(.degree = 1:2,
.nprune = 2:38),
trControl = ctrl)## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
## Multivariate Adaptive Regression Spline
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.1527874 0.2164850 0.11922540
## 1 3 0.1457986 0.2863438 0.11355089
## 1 4 0.1452585 0.2918423 0.11308342
## 1 5 0.1441796 0.3026028 0.11197668
## 1 6 0.1415169 0.3270726 0.10987591
## 1 7 0.1404233 0.3375773 0.10888650
## 1 8 0.1393220 0.3479701 0.10820117
## 1 9 0.1376909 0.3630413 0.10675904
## 1 10 0.1361410 0.3783478 0.10524108
## 1 11 0.1359734 0.3793368 0.10491628
## 1 12 0.1356803 0.3821798 0.10448616
## 1 13 0.1371386 0.3722615 0.10517051
## 1 14 0.1371648 0.3721012 0.10514100
## 1 15 0.1376772 0.3688911 0.10526938
## 1 16 0.1377746 0.3686955 0.10511907
## 1 17 0.1377336 0.3690273 0.10493933
## 1 18 0.1376191 0.3702063 0.10482253
## 1 19 0.1376361 0.3699936 0.10482417
## 1 20 0.1377596 0.3690091 0.10470162
## 1 21 0.1376697 0.3698763 0.10467845
## 1 22 0.1372426 0.3737903 0.10428601
## 1 23 0.1373113 0.3733069 0.10431459
## 1 24 0.1371606 0.3744426 0.10429175
## 1 25 0.1369533 0.3761997 0.10418090
## 1 26 0.1366941 0.3786725 0.10387195
## 1 27 0.1368112 0.3779475 0.10391476
## 1 28 0.1367762 0.3786085 0.10390836
## 1 29 0.1366418 0.3798313 0.10380135
## 1 30 0.1364046 0.3818173 0.10374862
## 1 31 0.1366823 0.3794642 0.10376088
## 1 32 0.1367684 0.3788260 0.10381211
## 1 33 0.1371391 0.3760662 0.10401979
## 1 34 0.1371362 0.3761394 0.10397847
## 1 35 0.1373108 0.3745209 0.10399745
## 1 36 0.1374006 0.3739643 0.10405588
## 1 37 0.1374042 0.3738738 0.10411443
## 1 38 0.1374042 0.3738738 0.10411443
## 2 2 0.1527874 0.2164850 0.11922540
## 2 3 0.1461656 0.2830313 0.11377071
## 2 4 0.1448162 0.2964531 0.11218248
## 2 5 0.1432118 0.3120895 0.11125386
## 2 6 0.1413509 0.3291617 0.10987207
## 2 7 0.1399609 0.3424904 0.10835457
## 2 8 0.1392349 0.3520761 0.10718962
## 2 9 0.1356488 0.3813322 0.10373772
## 2 10 0.1362731 0.3780690 0.10382089
## 2 11 0.1365127 0.3769165 0.10333758
## 2 12 0.1364264 0.3778924 0.10319761
## 2 13 0.1351820 0.3883888 0.10240425
## 2 14 0.1355442 0.3855695 0.10251147
## 2 15 0.1346713 0.3928119 0.10169420
## 2 16 0.1337491 0.4010124 0.10125582
## 2 17 0.1337139 0.4018183 0.10135034
## 2 18 0.1330777 0.4078082 0.10074254
## 2 19 0.1330450 0.4084463 0.10053691
## 2 20 0.1327534 0.4110547 0.10032196
## 2 21 0.1325528 0.4127307 0.10017536
## 2 22 0.1321892 0.4157352 0.09981311
## 2 23 0.1318251 0.4188860 0.09946667
## 2 24 0.1318599 0.4185769 0.09949694
## 2 25 0.1320799 0.4167887 0.09960169
## 2 26 0.1321709 0.4160313 0.09959809
## 2 27 0.1320612 0.4169283 0.09951555
## 2 28 0.1320617 0.4168964 0.09952595
## 2 29 0.1320048 0.4173713 0.09945129
## 2 30 0.1320310 0.4171650 0.09951915
## 2 31 0.1320310 0.4171650 0.09951915
## 2 32 0.1320310 0.4171650 0.09951915
## 2 33 0.1320310 0.4171650 0.09951915
## 2 34 0.1320310 0.4171650 0.09951915
## 2 35 0.1320310 0.4171650 0.09951915
## 2 36 0.1320310 0.4171650 0.09951915
## 2 37 0.1320310 0.4171650 0.09951915
## 2 38 0.1320310 0.4171650 0.09951915
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 23 and degree = 2.
## Selected 23 of 29 terms, and 11 of 28 predictors (nprune=23)
## Termination condition: RSq changed by less than 0.001 at 29 terms
## Importance: Mnf.Flow, Brand.Code_C, Alch.Rel, Bowl.Setpoint, ...
## Number of terms at each degree of interaction: 1 5 17
## GCV 0.01611237 RSS 31.31483 GRSq 0.4573021 RSq 0.4859905
NonLinear_MARS_Pred <- predict(NonLinear_MARS, newdata = data_test_X)
NonLinear_MARS_metrics <- postResample(pred = NonLinear_MARS_Pred, obs = data_test_Y)
NonLinear_MARS_metrics## RMSE Rsquared MAE
## 0.12396741 0.49036903 0.09564496
6.2.5 Neural Network
The final neural network model is size = 5, decay = 0.01, with RMSE and R2 0.11423783 and R2 0.56938536 respectively.
set.seed(0)
NonLinear_NNet <- train(data_train_X, data_train_Y,
method ='avNNet',
tuneGrid = expand.grid(.decay = seq(0.01,0.1,0.02),
.size = c(1:5),
.bag = FALSE),
trControl = trainControl(method = "cv"),
trace = FALSE,
linout =TRUE#,
#MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
#maxit = 500
)
NonLinear_NNet## Model Averaged Neural Network
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.01 1 0.1390464 0.3530330 0.10718967
## 0.01 2 0.1434126 0.3389881 0.10782088
## 0.01 3 0.1526538 0.3942120 0.10075708
## 0.01 4 0.1257428 0.4693378 0.09528213
## 0.01 5 0.1233552 0.4889622 0.09328839
## 0.03 1 0.1386663 0.3554992 0.10775591
## 0.03 2 0.1388569 0.3613455 0.10756416
## 0.03 3 0.1315032 0.4201018 0.10026793
## 0.03 4 0.1258351 0.4704857 0.09536096
## 0.03 5 0.1247329 0.4793170 0.09451236
## 0.05 1 0.1384113 0.3582842 0.10760727
## 0.05 2 0.1418169 0.3406693 0.10996645
## 0.05 3 0.1307451 0.4301629 0.09983160
## 0.05 4 0.1269153 0.4592491 0.09681098
## 0.05 5 0.1243520 0.4819411 0.09394431
## 0.07 1 0.1387791 0.3543691 0.10793027
## 0.07 2 0.1433552 0.3242280 0.11137415
## 0.07 3 0.1307574 0.4302750 0.10031547
## 0.07 4 0.1275863 0.4536516 0.09767131
## 0.07 5 0.1249580 0.4762182 0.09519004
## 0.09 1 0.1384934 0.3572619 0.10786324
## 0.09 2 0.1388061 0.3677063 0.10781687
## 0.09 3 0.1297552 0.4408124 0.09960522
## 0.09 4 0.1263112 0.4645131 0.09599997
## 0.09 5 0.1251908 0.4737351 0.09525916
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.01 and bag
## = FALSE.
NonLinear_NNet_Pred <- predict(NonLinear_NNet, newdata = data_test_X)
NonLinear_NNet_metrics <- postResample(pred = NonLinear_NNet_Pred, obs = data_test_Y)
NonLinear_NNet_metrics## RMSE Rsquared MAE
## 0.11423783 0.56938536 0.08687277
6.3 Tree-Based Regression Models
6.3.1 Random Forest
The optimal mtry = 15.
The corresplonding resampled estimate of RMSE and R2 are 0.09784328 and 0.69226170 respectively.
set.seed(0)
TreeBased_RF <- train(x = data_train_X,
y = data_train_Y,
method = "rf",
trControl = ctrl)
TreeBased_RF## Random Forest
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.1165576 0.5859532 0.08864558
## 15 0.1046622 0.6441878 0.07596282
## 28 0.1054225 0.6312982 0.07518499
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 15.
TreeBased_RF_Pred <- predict(TreeBased_RF, newdata = data_test_X)
TreeBased_RF_metrics <- postResample(pred = TreeBased_RF_Pred, obs = data_test_Y)
TreeBased_RF_metrics## RMSE Rsquared MAE
## 0.09784328 0.69226170 0.07327428
6.3.2 Gradient Boosting Machine
The optimal n.trees = 900, interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.
The corresplonding resampled estimate of RMSE and R2 are 0.1104675 and 0.5972602 respectively.
set.seed(0)
TreeBased_GBM <- train(x = data_train_X,
y = data_train_Y,
method = "gbm",
tuneGrid = expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = c(5,10)),
tuneLength = 10,
trControl = ctrl,
verbose = FALSE)
TreeBased_GBM## Stochastic Gradient Boosting
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE
## 0.01 1 5 100 0.1535056
## 0.01 1 5 150 0.1490350
## 0.01 1 5 200 0.1458946
## 0.01 1 5 250 0.1436576
## 0.01 1 5 300 0.1419323
## 0.01 1 5 350 0.1406554
## 0.01 1 5 400 0.1396556
## 0.01 1 5 450 0.1389351
## 0.01 1 5 500 0.1382906
## 0.01 1 5 550 0.1378580
## 0.01 1 5 600 0.1374626
## 0.01 1 5 650 0.1371200
## 0.01 1 5 700 0.1367336
## 0.01 1 5 750 0.1364098
## 0.01 1 5 800 0.1361145
## 0.01 1 5 850 0.1358792
## 0.01 1 5 900 0.1356063
## 0.01 1 5 950 0.1353368
## 0.01 1 5 1000 0.1351189
## 0.01 1 10 100 0.1536465
## 0.01 1 10 150 0.1491078
## 0.01 1 10 200 0.1460110
## 0.01 1 10 250 0.1437385
## 0.01 1 10 300 0.1420114
## 0.01 1 10 350 0.1406584
## 0.01 1 10 400 0.1397182
## 0.01 1 10 450 0.1389405
## 0.01 1 10 500 0.1383112
## 0.01 1 10 550 0.1377765
## 0.01 1 10 600 0.1373695
## 0.01 1 10 650 0.1370119
## 0.01 1 10 700 0.1366433
## 0.01 1 10 750 0.1362744
## 0.01 1 10 800 0.1360030
## 0.01 1 10 850 0.1357687
## 0.01 1 10 900 0.1355482
## 0.01 1 10 950 0.1352849
## 0.01 1 10 1000 0.1350739
## 0.01 3 5 100 0.1443981
## 0.01 3 5 150 0.1387395
## 0.01 3 5 200 0.1352581
## 0.01 3 5 250 0.1330947
## 0.01 3 5 300 0.1316020
## 0.01 3 5 350 0.1304906
## 0.01 3 5 400 0.1295551
## 0.01 3 5 450 0.1287383
## 0.01 3 5 500 0.1280514
## 0.01 3 5 550 0.1274304
## 0.01 3 5 600 0.1267289
## 0.01 3 5 650 0.1262282
## 0.01 3 5 700 0.1258221
## 0.01 3 5 750 0.1254674
## 0.01 3 5 800 0.1251154
## 0.01 3 5 850 0.1247970
## 0.01 3 5 900 0.1245305
## 0.01 3 5 950 0.1243256
## 0.01 3 5 1000 0.1240812
## 0.01 3 10 100 0.1445042
## 0.01 3 10 150 0.1386595
## 0.01 3 10 200 0.1351232
## 0.01 3 10 250 0.1329691
## 0.01 3 10 300 0.1313775
## 0.01 3 10 350 0.1301674
## 0.01 3 10 400 0.1292257
## 0.01 3 10 450 0.1284388
## 0.01 3 10 500 0.1277175
## 0.01 3 10 550 0.1271278
## 0.01 3 10 600 0.1265867
## 0.01 3 10 650 0.1260797
## 0.01 3 10 700 0.1256559
## 0.01 3 10 750 0.1252243
## 0.01 3 10 800 0.1248902
## 0.01 3 10 850 0.1245717
## 0.01 3 10 900 0.1242475
## 0.01 3 10 950 0.1239413
## 0.01 3 10 1000 0.1237621
## 0.01 5 5 100 0.1407584
## 0.01 5 5 150 0.1347082
## 0.01 5 5 200 0.1310276
## 0.01 5 5 250 0.1286710
## 0.01 5 5 300 0.1269540
## 0.01 5 5 350 0.1257140
## 0.01 5 5 400 0.1246789
## 0.01 5 5 450 0.1239186
## 0.01 5 5 500 0.1233149
## 0.01 5 5 550 0.1226674
## 0.01 5 5 600 0.1220399
## 0.01 5 5 650 0.1216075
## 0.01 5 5 700 0.1212381
## 0.01 5 5 750 0.1208889
## 0.01 5 5 800 0.1205311
## 0.01 5 5 850 0.1201081
## 0.01 5 5 900 0.1198700
## 0.01 5 5 950 0.1196036
## 0.01 5 5 1000 0.1193918
## 0.01 5 10 100 0.1405693
## 0.01 5 10 150 0.1344117
## 0.01 5 10 200 0.1307994
## 0.01 5 10 250 0.1282400
## 0.01 5 10 300 0.1265838
## 0.01 5 10 350 0.1253331
## 0.01 5 10 400 0.1242838
## 0.01 5 10 450 0.1234494
## 0.01 5 10 500 0.1226329
## 0.01 5 10 550 0.1219982
## 0.01 5 10 600 0.1214775
## 0.01 5 10 650 0.1209410
## 0.01 5 10 700 0.1206051
## 0.01 5 10 750 0.1201725
## 0.01 5 10 800 0.1198850
## 0.01 5 10 850 0.1195760
## 0.01 5 10 900 0.1192327
## 0.01 5 10 950 0.1190077
## 0.01 5 10 1000 0.1188001
## 0.01 7 5 100 0.1384275
## 0.01 7 5 150 0.1320876
## 0.01 7 5 200 0.1282591
## 0.01 7 5 250 0.1257032
## 0.01 7 5 300 0.1239902
## 0.01 7 5 350 0.1225967
## 0.01 7 5 400 0.1216320
## 0.01 7 5 450 0.1206244
## 0.01 7 5 500 0.1199510
## 0.01 7 5 550 0.1193445
## 0.01 7 5 600 0.1187967
## 0.01 7 5 650 0.1184142
## 0.01 7 5 700 0.1180329
## 0.01 7 5 750 0.1176940
## 0.01 7 5 800 0.1173624
## 0.01 7 5 850 0.1170728
## 0.01 7 5 900 0.1167934
## 0.01 7 5 950 0.1165117
## 0.01 7 5 1000 0.1162684
## 0.01 7 10 100 0.1381935
## 0.01 7 10 150 0.1316462
## 0.01 7 10 200 0.1276732
## 0.01 7 10 250 0.1251146
## 0.01 7 10 300 0.1232905
## 0.01 7 10 350 0.1220781
## 0.01 7 10 400 0.1210549
## 0.01 7 10 450 0.1202724
## 0.01 7 10 500 0.1195638
## 0.01 7 10 550 0.1189483
## 0.01 7 10 600 0.1183244
## 0.01 7 10 650 0.1179829
## 0.01 7 10 700 0.1175881
## 0.01 7 10 750 0.1172371
## 0.01 7 10 800 0.1169624
## 0.01 7 10 850 0.1166893
## 0.01 7 10 900 0.1163924
## 0.01 7 10 950 0.1161454
## 0.01 7 10 1000 0.1159055
## 0.10 1 5 100 0.1353870
## 0.10 1 5 150 0.1340044
## 0.10 1 5 200 0.1331607
## 0.10 1 5 250 0.1325162
## 0.10 1 5 300 0.1322091
## 0.10 1 5 350 0.1322188
## 0.10 1 5 400 0.1319665
## 0.10 1 5 450 0.1318188
## 0.10 1 5 500 0.1319638
## 0.10 1 5 550 0.1320879
## 0.10 1 5 600 0.1321047
## 0.10 1 5 650 0.1319325
## 0.10 1 5 700 0.1323059
## 0.10 1 5 750 0.1323037
## 0.10 1 5 800 0.1322896
## 0.10 1 5 850 0.1326825
## 0.10 1 5 900 0.1326673
## 0.10 1 5 950 0.1326552
## 0.10 1 5 1000 0.1326691
## 0.10 1 10 100 0.1353808
## 0.10 1 10 150 0.1335157
## 0.10 1 10 200 0.1326322
## 0.10 1 10 250 0.1321384
## 0.10 1 10 300 0.1319427
## 0.10 1 10 350 0.1319482
## 0.10 1 10 400 0.1320047
## 0.10 1 10 450 0.1317027
## 0.10 1 10 500 0.1320438
## 0.10 1 10 550 0.1319677
## 0.10 1 10 600 0.1317611
## 0.10 1 10 650 0.1320805
## 0.10 1 10 700 0.1319790
## 0.10 1 10 750 0.1319191
## 0.10 1 10 800 0.1318749
## 0.10 1 10 850 0.1322117
## 0.10 1 10 900 0.1322141
## 0.10 1 10 950 0.1323673
## 0.10 1 10 1000 0.1324594
## 0.10 3 5 100 0.1247993
## 0.10 3 5 150 0.1239652
## 0.10 3 5 200 0.1232386
## 0.10 3 5 250 0.1225885
## 0.10 3 5 300 0.1222911
## 0.10 3 5 350 0.1220773
## 0.10 3 5 400 0.1220112
## 0.10 3 5 450 0.1215983
## 0.10 3 5 500 0.1211350
## 0.10 3 5 550 0.1213181
## 0.10 3 5 600 0.1209272
## 0.10 3 5 650 0.1206428
## 0.10 3 5 700 0.1205694
## 0.10 3 5 750 0.1204253
## 0.10 3 5 800 0.1203508
## 0.10 3 5 850 0.1201017
## 0.10 3 5 900 0.1202024
## 0.10 3 5 950 0.1203198
## 0.10 3 5 1000 0.1201014
## 0.10 3 10 100 0.1246780
## 0.10 3 10 150 0.1234524
## 0.10 3 10 200 0.1226146
## 0.10 3 10 250 0.1216866
## 0.10 3 10 300 0.1209524
## 0.10 3 10 350 0.1208231
## 0.10 3 10 400 0.1208755
## 0.10 3 10 450 0.1206491
## 0.10 3 10 500 0.1206228
## 0.10 3 10 550 0.1204732
## 0.10 3 10 600 0.1202193
## 0.10 3 10 650 0.1201130
## 0.10 3 10 700 0.1202734
## 0.10 3 10 750 0.1203011
## 0.10 3 10 800 0.1200271
## 0.10 3 10 850 0.1201333
## 0.10 3 10 900 0.1202030
## 0.10 3 10 950 0.1201546
## 0.10 3 10 1000 0.1202208
## 0.10 5 5 100 0.1212583
## 0.10 5 5 150 0.1197410
## 0.10 5 5 200 0.1192567
## 0.10 5 5 250 0.1183344
## 0.10 5 5 300 0.1177544
## 0.10 5 5 350 0.1172734
## 0.10 5 5 400 0.1170780
## 0.10 5 5 450 0.1170512
## 0.10 5 5 500 0.1170092
## 0.10 5 5 550 0.1169448
## 0.10 5 5 600 0.1168782
## 0.10 5 5 650 0.1170498
## 0.10 5 5 700 0.1169686
## 0.10 5 5 750 0.1169860
## 0.10 5 5 800 0.1168449
## 0.10 5 5 850 0.1166501
## 0.10 5 5 900 0.1166432
## 0.10 5 5 950 0.1166941
## 0.10 5 5 1000 0.1166490
## 0.10 5 10 100 0.1203285
## 0.10 5 10 150 0.1186437
## 0.10 5 10 200 0.1178004
## 0.10 5 10 250 0.1173107
## 0.10 5 10 300 0.1169347
## 0.10 5 10 350 0.1164742
## 0.10 5 10 400 0.1160615
## 0.10 5 10 450 0.1158243
## 0.10 5 10 500 0.1154760
## 0.10 5 10 550 0.1153967
## 0.10 5 10 600 0.1152440
## 0.10 5 10 650 0.1151973
## 0.10 5 10 700 0.1150863
## 0.10 5 10 750 0.1151455
## 0.10 5 10 800 0.1148742
## 0.10 5 10 850 0.1149837
## 0.10 5 10 900 0.1146944
## 0.10 5 10 950 0.1148337
## 0.10 5 10 1000 0.1148119
## 0.10 7 5 100 0.1184026
## 0.10 7 5 150 0.1179315
## 0.10 7 5 200 0.1180765
## 0.10 7 5 250 0.1179658
## 0.10 7 5 300 0.1175782
## 0.10 7 5 350 0.1170519
## 0.10 7 5 400 0.1168684
## 0.10 7 5 450 0.1165944
## 0.10 7 5 500 0.1164862
## 0.10 7 5 550 0.1165281
## 0.10 7 5 600 0.1164659
## 0.10 7 5 650 0.1164430
## 0.10 7 5 700 0.1162906
## 0.10 7 5 750 0.1163773
## 0.10 7 5 800 0.1164120
## 0.10 7 5 850 0.1164677
## 0.10 7 5 900 0.1165455
## 0.10 7 5 950 0.1165966
## 0.10 7 5 1000 0.1165314
## 0.10 7 10 100 0.1175349
## 0.10 7 10 150 0.1170537
## 0.10 7 10 200 0.1166296
## 0.10 7 10 250 0.1164940
## 0.10 7 10 300 0.1162592
## 0.10 7 10 350 0.1162896
## 0.10 7 10 400 0.1160903
## 0.10 7 10 450 0.1158844
## 0.10 7 10 500 0.1161168
## 0.10 7 10 550 0.1159057
## 0.10 7 10 600 0.1157356
## 0.10 7 10 650 0.1155581
## 0.10 7 10 700 0.1154481
## 0.10 7 10 750 0.1153436
## 0.10 7 10 800 0.1156930
## 0.10 7 10 850 0.1156195
## 0.10 7 10 900 0.1155034
## 0.10 7 10 950 0.1156759
## 0.10 7 10 1000 0.1157331
## Rsquared MAE
## 0.2923043 0.12104582
## 0.3175356 0.11713260
## 0.3345305 0.11446047
## 0.3457517 0.11262829
## 0.3539033 0.11130744
## 0.3601665 0.11020835
## 0.3651888 0.10940607
## 0.3693003 0.10887048
## 0.3730780 0.10838540
## 0.3758276 0.10801547
## 0.3780504 0.10770114
## 0.3807034 0.10741898
## 0.3834791 0.10710048
## 0.3859153 0.10683060
## 0.3881325 0.10657080
## 0.3898195 0.10634838
## 0.3920246 0.10610349
## 0.3939641 0.10585506
## 0.3953937 0.10564788
## 0.2901715 0.12113708
## 0.3173627 0.11721307
## 0.3324042 0.11449854
## 0.3442850 0.11266653
## 0.3530893 0.11133265
## 0.3596855 0.11022095
## 0.3647388 0.10947421
## 0.3687133 0.10884475
## 0.3724134 0.10833251
## 0.3762077 0.10791443
## 0.3788602 0.10762313
## 0.3814104 0.10728577
## 0.3841774 0.10699975
## 0.3866392 0.10666650
## 0.3886066 0.10641055
## 0.3902346 0.10622005
## 0.3918679 0.10600871
## 0.3938946 0.10576644
## 0.3955024 0.10553620
## 0.3895942 0.11361467
## 0.4085464 0.10883339
## 0.4221029 0.10593687
## 0.4322963 0.10407672
## 0.4401262 0.10275747
## 0.4464604 0.10176121
## 0.4524626 0.10089176
## 0.4575277 0.10012256
## 0.4619026 0.09946874
## 0.4662053 0.09887681
## 0.4714428 0.09827623
## 0.4750793 0.09783409
## 0.4776138 0.09744715
## 0.4799069 0.09709940
## 0.4823535 0.09674901
## 0.4844402 0.09642783
## 0.4861647 0.09615482
## 0.4873754 0.09593639
## 0.4890149 0.09567516
## 0.3877509 0.11364208
## 0.4095044 0.10877600
## 0.4228489 0.10593533
## 0.4328778 0.10405052
## 0.4421676 0.10273885
## 0.4491202 0.10165845
## 0.4548781 0.10077391
## 0.4600134 0.10005415
## 0.4650236 0.09932900
## 0.4689388 0.09871916
## 0.4725248 0.09816326
## 0.4759486 0.09764840
## 0.4788238 0.09718342
## 0.4817735 0.09675055
## 0.4839559 0.09636641
## 0.4862928 0.09609419
## 0.4885961 0.09572951
## 0.4907546 0.09543445
## 0.4919536 0.09523381
## 0.4296328 0.11056944
## 0.4465845 0.10542151
## 0.4595223 0.10223897
## 0.4698218 0.10009360
## 0.4794847 0.09860618
## 0.4862967 0.09744553
## 0.4927006 0.09649515
## 0.4973343 0.09570717
## 0.5007846 0.09510394
## 0.5049475 0.09443334
## 0.5090950 0.09384399
## 0.5117210 0.09335976
## 0.5139853 0.09298451
## 0.5162901 0.09264970
## 0.5187712 0.09233416
## 0.5216779 0.09195232
## 0.5232556 0.09171237
## 0.5249086 0.09144184
## 0.5262414 0.09120052
## 0.4305419 0.11040402
## 0.4496975 0.10519122
## 0.4622780 0.10210678
## 0.4745939 0.09982927
## 0.4829701 0.09830628
## 0.4898305 0.09716037
## 0.4960863 0.09617146
## 0.5011013 0.09532461
## 0.5062142 0.09453153
## 0.5101728 0.09390248
## 0.5133029 0.09335901
## 0.5171049 0.09284134
## 0.5192795 0.09246480
## 0.5219163 0.09200919
## 0.5236911 0.09174965
## 0.5257575 0.09142983
## 0.5281709 0.09110828
## 0.5296959 0.09084697
## 0.5309683 0.09059011
## 0.4553393 0.10873165
## 0.4715015 0.10322464
## 0.4840593 0.09986379
## 0.4955013 0.09761230
## 0.5037097 0.09601242
## 0.5112731 0.09465708
## 0.5169049 0.09363113
## 0.5232551 0.09268454
## 0.5271739 0.09195952
## 0.5309892 0.09133709
## 0.5346564 0.09079435
## 0.5369050 0.09041760
## 0.5390687 0.08998010
## 0.5410202 0.08963275
## 0.5431144 0.08930669
## 0.5450597 0.08901503
## 0.5467731 0.08875337
## 0.5487468 0.08850885
## 0.5503899 0.08830302
## 0.4579059 0.10834217
## 0.4765811 0.10278031
## 0.4897284 0.09926566
## 0.5007878 0.09692264
## 0.5099010 0.09521690
## 0.5154119 0.09405683
## 0.5212298 0.09301995
## 0.5255340 0.09224600
## 0.5296746 0.09152930
## 0.5334226 0.09091125
## 0.5373467 0.09030151
## 0.5391987 0.08993093
## 0.5415483 0.08946552
## 0.5437602 0.08909500
## 0.5455212 0.08878540
## 0.5473915 0.08851441
## 0.5493444 0.08819497
## 0.5510494 0.08794372
## 0.5526001 0.08768313
## 0.3925706 0.10578607
## 0.4009384 0.10441821
## 0.4073512 0.10356795
## 0.4121818 0.10286105
## 0.4147433 0.10252442
## 0.4148174 0.10231741
## 0.4170615 0.10200739
## 0.4186291 0.10180281
## 0.4173253 0.10176632
## 0.4168509 0.10178692
## 0.4171331 0.10169856
## 0.4191245 0.10159296
## 0.4156994 0.10191771
## 0.4163146 0.10185401
## 0.4170539 0.10192938
## 0.4138965 0.10206631
## 0.4142706 0.10204338
## 0.4145678 0.10190521
## 0.4147332 0.10185310
## 0.3921931 0.10586555
## 0.4072644 0.10407100
## 0.4136823 0.10302749
## 0.4161934 0.10222333
## 0.4176470 0.10183855
## 0.4179291 0.10164997
## 0.4172452 0.10159452
## 0.4196887 0.10133641
## 0.4169207 0.10150627
## 0.4182346 0.10139667
## 0.4198290 0.10107966
## 0.4171574 0.10112351
## 0.4181224 0.10117512
## 0.4190221 0.10095109
## 0.4197240 0.10083218
## 0.4169570 0.10094134
## 0.4164783 0.10101379
## 0.4157212 0.10097037
## 0.4151814 0.10098202
## 0.4816502 0.09612469
## 0.4868661 0.09486796
## 0.4913976 0.09386725
## 0.4968902 0.09321608
## 0.4999181 0.09285739
## 0.5016669 0.09260927
## 0.5026058 0.09225028
## 0.5067216 0.09199024
## 0.5100292 0.09154079
## 0.5092293 0.09162251
## 0.5125149 0.09122717
## 0.5148562 0.09089508
## 0.5157139 0.09088356
## 0.5173058 0.09063209
## 0.5182242 0.09069230
## 0.5202329 0.09060502
## 0.5195152 0.09066987
## 0.5188649 0.09053892
## 0.5206071 0.09044505
## 0.4819876 0.09626836
## 0.4904744 0.09454040
## 0.4967831 0.09377631
## 0.5045420 0.09303392
## 0.5104180 0.09212121
## 0.5116055 0.09181519
## 0.5111501 0.09173337
## 0.5131878 0.09165600
## 0.5138878 0.09147914
## 0.5156730 0.09137932
## 0.5175690 0.09106740
## 0.5185697 0.09105451
## 0.5177694 0.09111288
## 0.5178604 0.09125541
## 0.5197956 0.09083979
## 0.5190885 0.09096936
## 0.5189439 0.09100255
## 0.5192169 0.09083851
## 0.5190606 0.09082384
## 0.5082410 0.09219655
## 0.5199920 0.09033539
## 0.5226496 0.08951928
## 0.5302999 0.08894231
## 0.5350123 0.08839156
## 0.5388060 0.08805741
## 0.5407253 0.08783215
## 0.5409549 0.08772513
## 0.5418992 0.08763445
## 0.5425453 0.08759442
## 0.5428591 0.08757097
## 0.5418499 0.08771346
## 0.5427772 0.08765661
## 0.5428787 0.08758450
## 0.5443327 0.08749978
## 0.5455025 0.08728405
## 0.5460617 0.08722536
## 0.5459019 0.08738157
## 0.5462553 0.08736333
## 0.5157180 0.09190604
## 0.5283063 0.09030123
## 0.5342711 0.08958764
## 0.5382390 0.08888796
## 0.5414564 0.08830597
## 0.5452703 0.08796681
## 0.5488859 0.08731427
## 0.5513184 0.08694026
## 0.5541820 0.08661303
## 0.5551141 0.08667164
## 0.5565919 0.08631202
## 0.5569887 0.08610650
## 0.5580502 0.08590139
## 0.5579847 0.08578173
## 0.5601827 0.08565956
## 0.5593102 0.08578147
## 0.5617171 0.08551281
## 0.5610253 0.08569260
## 0.5614420 0.08571081
## 0.5312223 0.08936245
## 0.5350188 0.08859739
## 0.5341844 0.08839159
## 0.5354758 0.08786617
## 0.5386357 0.08724049
## 0.5429063 0.08694732
## 0.5444806 0.08656257
## 0.5469085 0.08651211
## 0.5481268 0.08621271
## 0.5481705 0.08636112
## 0.5488321 0.08636786
## 0.5491554 0.08641019
## 0.5504592 0.08633135
## 0.5500163 0.08634585
## 0.5499226 0.08638559
## 0.5497237 0.08636256
## 0.5493203 0.08638006
## 0.5490384 0.08638911
## 0.5496591 0.08631969
## 0.5379357 0.08898425
## 0.5416412 0.08834651
## 0.5448512 0.08767391
## 0.5458605 0.08698501
## 0.5480120 0.08693774
## 0.5481531 0.08704799
## 0.5502196 0.08686922
## 0.5522713 0.08686386
## 0.5508473 0.08707606
## 0.5526465 0.08682283
## 0.5542019 0.08661854
## 0.5556558 0.08641946
## 0.5565818 0.08623262
## 0.5573549 0.08605023
## 0.5550600 0.08633598
## 0.5557930 0.08638667
## 0.5567567 0.08629577
## 0.5556182 0.08635852
## 0.5552863 0.08638720
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 900,
## interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.
TreeBased_GBM_Pred <- predict(TreeBased_GBM, newdata = data_test_X)
TreeBased_GBM_metrics <- postResample(pred = TreeBased_GBM_Pred, obs = data_test_Y)
TreeBased_GBM_metrics## RMSE Rsquared MAE
## 0.1104675 0.5972602 0.0845282
6.3.3 Cubist
The optimal committees = 20 and neighbors = 5.
The corresplonding resampled estimate of RMSE and R2 are 0.09987318 and 0.67114775 respectively.
set.seed(0)
TreeBased_Cubist <- train(x = data_train_X,
y = data_train_Y,
method = "cubist",
trControl = trainControl(method = 'cv'))
TreeBased_Cubist## Cubist
##
## 2054 samples
## 28 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.1286602 0.4755080 0.08985362
## 1 5 0.1239531 0.5287986 0.08520446
## 1 9 0.1236001 0.5245257 0.08516113
## 10 0 0.1117588 0.5832072 0.08091727
## 10 5 0.1054796 0.6275221 0.07473292
## 10 9 0.1054210 0.6270252 0.07520324
## 20 0 0.1107426 0.5919454 0.08024516
## 20 5 0.1042786 0.6350231 0.07382106
## 20 9 0.1043447 0.6342982 0.07434306
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
TreeBased_Cubist_Pred <- predict(TreeBased_Cubist, newdata = data_test_X)
TreeBased_Cubist_metrics <- postResample(pred = TreeBased_Cubist_Pred, obs = data_test_Y)
TreeBased_Cubist_metrics## RMSE Rsquared MAE
## 0.09987318 0.67114775 0.07325504
7 Model Selection
The SVM-Radial model has both lowest RMSE and highest R2, therefore it is selected to be the best model.
rbind(Linear_PLS_metrics,
Linear_Ridge_metrics,
Linear_LASSO_metrics,
Linear_eNet_metrics,
NonLinear_KNN_metrics,
NonLinear_SVMLinear_metrics,
NonLinear_SVMRadial_metrics,
NonLinear_MARS_metrics,
NonLinear_NNet_metrics,
TreeBased_RF_metrics,
TreeBased_GBM_metrics,
TreeBased_Cubist_metrics
) %>%
data.frame() %>%
arrange(RMSE)