1 Introduction

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

2 Load Package

The following R package are used in this project.

library(tidyverse)
library(rio)
library(skimr)
library(corrplot)
library(VIM)
library(Amelia)
library(caret)
library(recipes)
library(rsample)

3 Load Data

Two data sets are downloaded from Github

Training Data: StudentData.xlsx
Evaluation Data: StudentEvaluation.xlsx

df<-rio::import('https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project2/StudentData.xlsx')
df_eval <-rio::import('https://raw.githubusercontent.com/shirley-wong/Data-624/main/Project2/StudentEvaluation.xlsx')
df<-data.frame(df)
df_eval<-data.frame(df_eval)
head(df)

4 Exploratory data analysis

According to the data summary below,

The responsible variable [PH] is continuous, therefore regression model is expected to be built.
There are total 31 numerical predictors and 1 categorical predictor in the data set.
According to the missing value view, only 1% of the data are missing, the predictor that contains most missing value is [MFR], this missing ratio is 212/2571 = 8.25%. Therefore no predictor is suggested to be removed, imputation is to be included in the later data preprocess.
There are 4 rows in the training set which [PH] is missing, as imputing responsible variable is not meaningful in training set, therefore these 4 rows are suggested to be removed.
The majority of the continuous numerical predictors in both training set and evaluation set demonstrated skewed distribution, also some of the predictors contain negative values, therefore Yeo-Johnson transformation is used to remove the skewness.
A dummy variable will be created for categorical predictor [Brand.Code].
The pairwise correlation of predictors [Balling],[Hyd.Pressure3], [Density], [Balling.Lvl] and [Filler.Level], after missing value imputation, are greater than 0.9, therefore, they are suggested to be removed to avoid multicollinearity.

4.1 Training Data Summary

skim(df)

Data summary
Name	df
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand.Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill.Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC.Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb.Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb.Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC.Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC.CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf.Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb.Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill.Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd.Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd.Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd.Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd.Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler.Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler.Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage.cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb.Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure.Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen.Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl.Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure.Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air.Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch.Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb.Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling.Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

4.2 Evaluation Data Summary

skim(df_eval)

Data summary
Name	df_eval
Number of rows	267
Number of columns	33
_______________________
Column type frequency:
character	1
logical	1
numeric	31
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand.Code	8	0.97	1	1	0	4	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
PH	267	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	1	1.00	5.37	0.11	5.15	5.29	5.34	5.47	5.67	▂▇▃▅▁
Fill.Ounces	6	0.98	23.97	0.08	23.75	23.92	23.97	24.01	24.20	▁▅▇▃▁
PC.Volume	4	0.99	0.28	0.06	0.10	0.23	0.28	0.32	0.46	▁▆▇▅▁
Carb.Pressure	0	1.00	68.25	3.86	60.20	65.30	68.00	70.60	77.60	▃▆▇▃▂
Carb.Temp	1	1.00	141.23	4.30	130.00	138.40	140.80	143.80	154.00	▁▆▇▃▁
PSC	5	0.98	0.09	0.05	0.00	0.04	0.08	0.11	0.25	▆▇▃▂▁
PSC.Fill	3	0.99	0.19	0.11	0.02	0.10	0.18	0.26	0.62	▇▇▃▁▁
PSC.CO2	5	0.98	0.05	0.04	0.00	0.02	0.04	0.06	0.24	▇▃▂▁▁
Mnf.Flow	0	1.00	21.03	117.76	-100.20	-100.00	0.20	141.30	220.40	▇▁▁▆▂
Carb.Pressure1	4	0.99	123.04	4.42	113.00	120.20	123.40	125.50	136.00	▃▃▇▂▁
Fill.Pressure	2	0.99	48.14	3.44	37.80	46.00	47.80	50.20	60.20	▁▇▇▂▁
Hyd.Pressure1	0	1.00	12.01	13.53	-50.00	0.00	10.40	20.40	50.00	▁▁▇▆▂
Hyd.Pressure2	1	1.00	20.11	17.21	-50.00	0.00	26.80	34.80	61.40	▁▁▆▇▁
Hyd.Pressure3	1	1.00	19.61	16.56	-50.00	0.00	27.70	33.00	49.20	▁▁▆▃▇
Hyd.Pressure4	4	0.99	97.84	13.92	68.00	90.00	98.00	104.00	140.00	▅▆▇▂▁
Filler.Level	2	0.99	110.29	15.50	69.20	100.60	118.60	120.20	153.20	▂▃▇▇▁
Filler.Speed	10	0.96	3581.39	911.19	1006.00	3812.00	3978.00	3996.00	4020.00	▁▁▁▁▇
Temperature	2	0.99	66.23	1.69	63.80	65.40	65.80	66.60	75.40	▇▅▁▁▁
Usage.cont	2	0.99	20.90	3.00	12.90	18.12	21.44	23.74	24.60	▁▃▃▃▇
Carb.Flow	0	1.00	2408.64	1161.36	0.00	1083.00	3038.00	3215.00	3858.00	▂▃▁▆▇
Density	1	1.00	1.18	0.38	0.06	0.92	0.98	1.60	1.84	▁▁▇▁▅
MFR	31	0.88	697.80	96.40	15.60	707.00	724.60	731.45	784.80	▁▁▁▁▇
Balling	1	1.00	2.20	0.92	0.90	1.50	1.65	3.24	3.79	▅▇▁▂▅
Pressure.Vacuum	1	1.00	-5.17	0.58	-6.40	-5.60	-5.20	-4.80	-3.60	▁▇▆▃▁
Oxygen.Filler	3	0.99	0.05	0.05	0.00	0.02	0.03	0.05	0.40	▇▁▁▁▁
Bowl.Setpoint	1	1.00	109.62	15.02	70.00	100.00	120.00	120.00	130.00	▁▂▁▃▇
Pressure.Setpoint	2	0.99	47.73	2.06	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air.Pressurer	1	1.00	142.83	1.23	141.20	142.20	142.60	142.80	147.20	▅▇▁▁▁
Alch.Rel	3	0.99	6.91	0.50	6.40	6.54	6.58	7.18	7.82	▇▁▂▁▃
Carb.Rel	2	0.99	5.44	0.13	5.18	5.34	5.40	5.56	5.74	▂▇▂▃▂
Balling.Lvl	0	1.00	2.05	0.88	0.00	1.38	1.48	3.08	3.42	▁▃▇▁▇

4.3 Missing Value View

A plot of missing value distribution in the data set.

missmap(df)

4.4 Numerical Predictor Correlation after Missing Data Imputation

Using KNN to impute missing values of the training data set
compute pair-wise correlations and locate the predictors with pair-wire correlation greate than 0.9

findCorrelation(df %>% 
                  kNN() %>% 
                  select(!ends_with('imp'), -c(Brand.Code, PH)) %>% 
                  cor(),
                cutoff = 0.9,
                names = TRUE,
                verbose = TRUE)

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Warning in gowerD(don_dist_var, imp_dist_var, weights = weightsx,
## numericalX, : NAs introduced by coercion

## Compare row 23  and column  21 with corr  0.955 
##   Means:  0.248 vs 0.154 so flagging column 23 
## Compare row 14  and column  13 with corr  0.925 
##   Means:  0.246 vs 0.147 so flagging column 14 
## Compare row 21  and column  31 with corr  0.948 
##   Means:  0.21 vs 0.141 so flagging column 21 
## Compare row 31  and column  29 with corr  0.921 
##   Means:  0.18 vs 0.136 so flagging column 31 
## Compare row 16  and column  26 with corr  0.946 
##   Means:  0.189 vs 0.133 so flagging column 16 
## All correlations <= 0.9

## [1] "Balling"       "Hyd.Pressure3" "Density"       "Balling.Lvl"  
## [5] "Filler.Level"

5 Data Preprocess

For training set:

Remove rows where PH is empty/NA.
perform train-test-split, ratio 4/5.

For both training and evaluation set:

Impute missing values using bag trees
create dummy variable for categorical variables
center and scale numerical variables
remove skewness of numerical variables
remove predictors with near zero variance
remove predictors with correlation greater than 0.9

Note: Data preprocess can be performed during model training, however, as there are multiple models to be built in the later section, preprocessing data in advanced is more efficient than doing it during each model run.

set.seed(0)

# -- remove is.na(PH)
df <- df %>% 
  filter(!is.na(PH))

# -- data preprocess
data_prepProc <- recipe(PH ~ .,  df) %>%
  #Impute missing value
  step_bagimpute(all_predictors()) %>%
  # create dummy variable for categorical variables
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  # center and scale
  step_normalize(all_numeric(), -all_outcomes()) %>%
  # remove skewness
  step_YeoJohnson(all_nominal(), -all_outcomes()) %>%
  # remove near zero variance predictors
  step_nzv(all_nominal(), -all_outcomes()) %>%
  # remove predictors with correlation > 0.9
  step_corr(all_numeric(), -all_outcomes()) %>% 
  prep()

df_mod <- data_prepProc %>%
  bake(df)

df_eval_mod <- data_prepProc %>%
  bake(df_eval)

# train-test-split
df_split <- df_mod %>% initial_split(prop = 4/5)

# Training set
data_train_X <- training(df_split) %>% select(-PH) 
data_train_Y <- training(df_split) %>% .$PH

# Testing set
data_test_X <- testing(df_split) %>% select(-PH)
data_test_Y <- testing(df_split) %>% .$PH

# Evaluation Set
data_eval_X <- df_eval_mod %>% select(-PH) 

skim(df_mod)

Data summary
Name	df_mod
Number of rows	2567
Number of columns	29
_______________________
Column type frequency:
numeric	29
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	1	0.00	1.00	-3.11	-0.72	-0.22	0.81	3.10	▁▆▇▅▁
Fill.Ounces	1	0.00	1.00	-3.93	-0.63	-0.02	0.60	3.97	▁▂▇▂▁
PC.Volume	1	0.00	1.00	-3.28	-0.63	-0.10	0.58	3.31	▁▃▇▂▁
Carb.Pressure	1	0.00	1.00	-3.16	-0.74	0.00	0.67	3.15	▁▅▇▃▁
Carb.Temp	1	0.00	1.00	-3.08	-0.67	-0.08	0.66	3.17	▁▅▇▃▁
PSC	1	0.00	1.00	-1.69	-0.71	-0.14	0.56	3.78	▆▇▃▁▁
PSC.Fill	1	0.00	1.00	-1.67	-0.81	-0.13	0.55	3.62	▆▇▃▁▁
PSC.CO2	1	0.00	1.00	-1.32	-0.85	-0.39	0.55	4.29	▇▅▂▁▁
Mnf.Flow	1	0.00	1.00	-1.04	-1.04	0.38	0.97	1.71	▇▁▁▇▂
Carb.Pressure1	1	0.00	1.00	-3.60	-0.76	0.14	0.60	3.75	▁▃▇▂▁
Fill.Pressure	1	0.00	1.00	-4.19	-0.60	-0.48	0.66	3.93	▁▁▇▂▁
Hyd.Pressure1	1	0.00	1.00	-1.06	-1.00	-0.08	0.63	3.67	▇▅▂▁▁
Hyd.Pressure2	1	0.00	1.00	-1.27	-1.27	0.47	0.84	2.35	▇▂▇▅▁
Hyd.Pressure4	1	0.00	1.00	-2.62	-0.80	-0.04	0.42	3.46	▂▆▇▂▁
Temperature	1	0.00	1.00	-1.71	-0.56	-0.27	0.30	7.34	▇▃▁▁▁
Usage.cont	1	0.00	1.00	-3.00	-0.88	0.26	0.92	1.65	▁▃▅▃▇
Carb.Flow	1	0.00	1.00	-2.29	-1.22	0.52	0.67	2.46	▂▅▆▇▁
MFR	1	0.00	1.00	-5.13	0.17	0.38	0.45	1.55	▁▁▁▂▇
Pressure.Vacuum	1	0.00	1.00	-2.43	-0.67	-0.32	0.38	2.83	▂▇▅▃▁
Oxygen.Filler	1	0.00	1.00	-0.98	-0.55	-0.29	0.30	7.83	▇▁▁▁▁
Bowl.Setpoint	1	0.00	1.00	-2.57	-0.61	0.70	0.70	2.00	▁▂▃▇▁
Pressure.Setpoint	1	0.00	1.00	-1.77	-0.79	-0.79	1.17	2.16	▁▇▁▆▁
Air.Pressurer	1	0.00	1.00	-1.68	-0.52	-0.19	0.14	4.42	▅▇▁▁▁
Alch.Rel	1	0.00	1.00	-3.20	-0.71	-0.67	0.66	3.41	▁▇▂▃▁
Carb.Rel	1	0.00	1.00	-3.70	-0.75	-0.28	0.80	4.85	▁▇▆▂▁
PH	1	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Brand.Code_B	1	0.00	1.00	-1.00	-1.00	1.00	1.00	1.00	▇▁▁▁▇
Brand.Code_C	1	0.00	1.00	-0.41	-0.41	-0.41	-0.41	2.43	▇▁▁▁▂
Brand.Code_D	1	0.00	1.00	-0.56	-0.56	-0.56	-0.56	1.78	▇▁▁▁▂

6 Model building

Three categories of regression models are to be built in this section, including Linear Regression Models, Non-linear Regression Models and Tree-based Models. The model with best performance in the test data set will be selected as the final model.

The models to be built are as below:

Linear Regression Models: PLS, Ridge, LASSO and Elastic Net
Non-linear Regression Models: KNN, SVM-Linear, SVM-Radial, MARS and Neural Network
Tree-based Regression Models: Random Forest, Gradient Boosting Machine and Cubist

6.1 Linear Regression Models

6.1.1 PLS Regression

7th latent variables are optimal;
The corresponding resampled estimate of RMSE and R2 are 0.1362656 and 0.3739715 respectively.

set.seed(0)
ctrl <- trainControl(method = "cv", number = 10)
Linear_PLS <- train(data_train_X, data_train_Y,
                 method = 'pls',
                 tuneLength = 20,
                 trControl = ctrl)
Linear_PLS

## Partial Least Squares 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1497005  0.2470540  0.1176600
##    2     0.1430215  0.3139339  0.1116965
##    3     0.1413576  0.3297154  0.1108805
##    4     0.1396517  0.3458175  0.1093216
##    5     0.1390031  0.3516492  0.1085059
##    6     0.1384918  0.3566973  0.1080004
##    7     0.1384305  0.3573092  0.1081537
##    8     0.1384597  0.3570316  0.1080082
##    9     0.1385041  0.3566531  0.1080056
##   10     0.1385358  0.3563692  0.1080224
##   11     0.1385680  0.3560643  0.1080587
##   12     0.1385836  0.3559539  0.1080834
##   13     0.1385914  0.3558839  0.1080780
##   14     0.1385636  0.3561045  0.1080470
##   15     0.1385706  0.3560451  0.1080609
##   16     0.1385782  0.3559804  0.1080730
##   17     0.1385796  0.3559731  0.1080733
##   18     0.1385952  0.3558288  0.1080827
##   19     0.1386021  0.3557690  0.1080826
##   20     0.1386004  0.3557876  0.1080814
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 7.

Linear_PLS_pred <- predict(Linear_PLS, data_train_X)
Linear_PLS_metrics <- postResample(Linear_PLS_pred, data_train_Y)
Linear_PLS_metrics

##      RMSE  Rsquared       MAE 
## 0.1362656 0.3739715 0.1064367

6.1.2 Ridge Regression

lambda = 0.03157895 is optimal;
The corresponding resampled estimate of RMSE and R2 are 0.1299868 and 0.4415918 respectively.

set.seed(0)
ctrl <- trainControl(method = "cv", number = 10)
ridgeGrid <- data.frame(.lambda = seq(0, .2, length = 20))
Linear_Ridge <- train(data_train_X, data_train_Y,
                   method = 'ridge',
                   tuneGrid = ridgeGrid,
                   trControl = ctrl)

Linear_Ridge

## Ridge Regression 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE       Rsquared   MAE      
##   0.00000000  0.1386059  0.3557400  0.1080834
##   0.01052632  0.1385244  0.3564372  0.1080449
##   0.02105263  0.1384937  0.3566985  0.1080301
##   0.03157895  0.1384906  0.3567237  0.1080267
##   0.04210526  0.1385055  0.3565978  0.1080296
##   0.05263158  0.1385331  0.3563667  0.1080395
##   0.06315789  0.1385701  0.3560587  0.1080534
##   0.07368421  0.1386146  0.3556927  0.1080730
##   0.08421053  0.1386650  0.3552820  0.1081018
##   0.09473684  0.1387202  0.3548367  0.1081358
##   0.10526316  0.1387795  0.3543642  0.1081742
##   0.11578947  0.1388424  0.3538704  0.1082172
##   0.12631579  0.1389082  0.3533599  0.1082615
##   0.13684211  0.1389767  0.3528364  0.1083094
##   0.14736842  0.1390475  0.3523031  0.1083598
##   0.15789474  0.1391204  0.3517622  0.1084124
##   0.16842105  0.1391953  0.3512160  0.1084677
##   0.17894737  0.1392719  0.3506661  0.1085256
##   0.18947368  0.1393501  0.3501139  0.1085829
##   0.20000000  0.1394298  0.3495607  0.1086415
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.03157895.

Linear_Ridge_Pred <- predict(Linear_Ridge, newdata = data_test_X)
Linear_Ridge_metrics <- postResample(pred = Linear_Ridge_Pred, obs = data_test_Y)
Linear_Ridge_metrics

##      RMSE  Rsquared       MAE 
## 0.1299868 0.4415918 0.1021300

6.1.3 LASSO

The Optimal fraction is 0.1,
The corresponding resampled estimate of RMSE and R2 are 0.1561285 and 0.2961838 respectively.

set.seed(0)
lassoGrid <- data.frame(.fraction = seq(0.01, .1, length = 20))
Linear_LASSO <- train(data_train_X, data_train_Y,
                   method = 'lasso',
                   tuneGrid = lassoGrid,
                   trControl = ctrl)

Linear_LASSO

## The lasso 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   fraction    RMSE       Rsquared   MAE      
##   0.01000000  0.1702219  0.1939752  0.1358928
##   0.01473684  0.1693460  0.1939752  0.1350627
##   0.01947368  0.1684926  0.1939752  0.1342976
##   0.02421053  0.1676620  0.1939752  0.1335504
##   0.02894737  0.1668545  0.1939752  0.1328056
##   0.03368421  0.1660705  0.1939752  0.1321218
##   0.03842105  0.1653103  0.1939752  0.1314585
##   0.04315789  0.1645743  0.1939752  0.1308063
##   0.04789474  0.1638627  0.1939752  0.1301621
##   0.05263158  0.1631759  0.1939752  0.1295284
##   0.05736842  0.1625142  0.1939752  0.1289066
##   0.06210526  0.1619037  0.1954704  0.1283338
##   0.06684211  0.1613301  0.1989758  0.1277900
##   0.07157895  0.1607555  0.2047889  0.1272578
##   0.07631579  0.1601689  0.2114635  0.1267473
##   0.08105263  0.1595945  0.2174196  0.1262680
##   0.08578947  0.1590325  0.2227269  0.1257959
##   0.09052632  0.1584829  0.2274517  0.1253301
##   0.09526316  0.1579511  0.2316237  0.1248748
##   0.10000000  0.1574345  0.2354022  0.1244261
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.1.

Linear_LASSO_Pred <- predict(Linear_LASSO, newdata = data_test_X)
Linear_LASSO_metrics <- postResample(pred = Linear_LASSO_Pred, obs = data_test_Y)
Linear_LASSO_metrics

##      RMSE  Rsquared       MAE 
## 0.1561285 0.2961838 0.1274395

6.1.4 Elastic Net

The optimal fraction = 0.1 and lambda = 0.2,
The corresponding resampled estimate of RMSE and R2 are 0.1589297 and 0.2697740 respectively.

set.seed(0)
enetGrid <- data.frame(.lambda = seq(0, .2, length = 20),
                       .fraction = seq(0.01, .1, length = 20))
Linear_eNet <- train(data_train_X, data_train_Y,
                   method = 'enet',
                   tuneGrid = enetGrid,
                   trControl = ctrl)

Linear_eNet

## Elasticnet 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   lambda      fraction    RMSE       Rsquared   MAE      
##   0.00000000  0.01000000  0.1702219  0.1939752  0.1358928
##   0.01052632  0.01473684  0.1694478  0.1939752  0.1351553
##   0.02105263  0.01947368  0.1687105  0.1939752  0.1344917
##   0.03157895  0.02421053  0.1680056  0.1939752  0.1338616
##   0.04210526  0.02894737  0.1673297  0.1939752  0.1332466
##   0.05263158  0.03368421  0.1666792  0.1939752  0.1326442
##   0.06315789  0.03842105  0.1660529  0.1939752  0.1321087
##   0.07368421  0.04315789  0.1654493  0.1939752  0.1315835
##   0.08421053  0.04789474  0.1648661  0.1939752  0.1310710
##   0.09473684  0.05263158  0.1643026  0.1939752  0.1305686
##   0.10526316  0.05736842  0.1637586  0.1939752  0.1300747
##   0.11578947  0.06210526  0.1632339  0.1939752  0.1295900
##   0.12631579  0.06684211  0.1627265  0.1939752  0.1291161
##   0.13684211  0.07157895  0.1622458  0.1942561  0.1286614
##   0.14736842  0.07631579  0.1618001  0.1953714  0.1282402
##   0.15789474  0.08105263  0.1613795  0.1982185  0.1278414
##   0.16842105  0.08578947  0.1609593  0.2025491  0.1274507
##   0.17894737  0.09052632  0.1605323  0.2074739  0.1270651
##   0.18947368  0.09526316  0.1601179  0.2122214  0.1266996
##   0.20000000  0.10000000  0.1597176  0.2167320  0.1263679
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.1 and lambda = 0.2.

Linear_eNet_Pred <- predict(Linear_eNet, newdata = data_test_X)
Linear_eNet_metrics <- postResample(pred = Linear_eNet_Pred, obs = data_test_Y)
Linear_eNet_metrics

##      RMSE  Rsquared       MAE 
## 0.1589297 0.2697740 0.1299668

6.2 Non-Linear Regression Models

6.2.1 KNN

The optimal k is 7;
The corresponding resampled estimate of RMSE and R2 are 0.10585060 and 0.62857413 respectively.

set.seed(0)
NonLinear_KNN <- train(data_train_X, data_train_Y,
                 method = 'knn',
                 tuneLength = 10,
                 trControl = ctrl)
NonLinear_KNN

## k-Nearest Neighbors 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1257029  0.4775757  0.09351756
##    7  0.1237475  0.4906292  0.09276375
##    9  0.1242006  0.4868828  0.09366748
##   11  0.1258387  0.4745378  0.09549822
##   13  0.1263061  0.4712000  0.09587242
##   15  0.1274855  0.4620434  0.09716663
##   17  0.1284044  0.4544409  0.09826715
##   19  0.1287749  0.4513034  0.09857713
##   21  0.1292793  0.4471276  0.09919209
##   23  0.1298352  0.4422596  0.09962648
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

NonLinear_KNN_pred <- predict(NonLinear_KNN, data_train_X)
NonLinear_KNN_metrics <- postResample(NonLinear_KNN_pred, data_train_Y)
NonLinear_KNN_metrics

##       RMSE   Rsquared        MAE 
## 0.10585060 0.62857413 0.07894874

6.2.2 SVM-Linear

The optimal epsilon = 0.1 and cost C = 1;
The corresponding resampled estimate of RMSE and R2 are 0.1381481 and 0.3615830 respectively.

set.seed(0)
NonLinear_SVMLinear <- train(data_train_X, data_train_Y,
                 method = 'svmLinear',
                 tuneLength = 15,
                 trControl = ctrl)
NonLinear_SVMLinear

## Support Vector Machines with Linear Kernel 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1405161  0.3452223  0.1072494
## 
## Tuning parameter 'C' was held constant at a value of 1

NonLinear_SVMLinear$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 1831 
## 
## Objective Function Value : -1053.426 
## Training error : 0.643132

NonLinear_SVMLinear_pred <- predict(NonLinear_SVMLinear, data_train_X)
NonLinear_SVMLinear_metrics <- postResample(NonLinear_SVMLinear_pred, data_train_Y)
NonLinear_SVMLinear_metrics

##      RMSE  Rsquared       MAE 
## 0.1381481 0.3615830 0.1045695

6.2.3 SVM-Radial

The optimal sigma = 0.0242724 and C = 4;
The corresponding resampled estimate of RMSE and R2 are 0.08011998 and 0.79263724 respectively.

set.seed(0)
NonLinear_SVMRadial <- train(data_train_X, data_train_Y,
                 method = 'svmRadial',
                 tuneLength = 15,
                 trControl = ctrl)
NonLinear_SVMRadial

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1286431  0.4526820  0.09577483
##      0.50  0.1256923  0.4758057  0.09278004
##      1.00  0.1231104  0.4952829  0.09035109
##      2.00  0.1210941  0.5106732  0.08867772
##      4.00  0.1204826  0.5158644  0.08851988
##      8.00  0.1212283  0.5141755  0.08924725
##     16.00  0.1224728  0.5116971  0.09033769
##     32.00  0.1258334  0.4986777  0.09296903
##     64.00  0.1326503  0.4687005  0.09806454
##    128.00  0.1389973  0.4449388  0.10296902
##    256.00  0.1452495  0.4218464  0.10818173
##    512.00  0.1510565  0.4016687  0.11316640
##   1024.00  0.1519305  0.3984248  0.11383537
##   2048.00  0.1519305  0.3984248  0.11383537
##   4096.00  0.1519305  0.3984248  0.11383537
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0242724
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0242724 and C = 4.

NonLinear_SVMRadial$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 4 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0242723997688406 
## 
## Number of Support Vectors : 1748 
## 
## Objective Function Value : -2289.491 
## Training error : 0.216318

NonLinear_SVMRadial_pred <- predict(NonLinear_SVMRadial, data_train_X)
NonLinear_SVMRadial_metrics <- postResample(NonLinear_SVMRadial_pred, data_train_Y)
NonLinear_SVMRadial_metrics

##       RMSE   Rsquared        MAE 
## 0.08011998 0.79263724 0.05028598

6.2.4 MARS

The optimal nprune = 23 and degree = 2.
The corresponding resampled estimate of RMSE and R2 are 0.12396741 and 0.49036903 respectively.

set.seed(0)
NonLinear_MARS <- train(data_train_X, data_train_Y,
                 method ='earth',
                 tuneGrid = expand.grid(.degree = 1:2, 
                                        .nprune = 2:38),
                 trControl = ctrl)

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

## Loading required package: TeachingDemos

NonLinear_MARS

## Multivariate Adaptive Regression Spline 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE       
##   1        2      0.1527874  0.2164850  0.11922540
##   1        3      0.1457986  0.2863438  0.11355089
##   1        4      0.1452585  0.2918423  0.11308342
##   1        5      0.1441796  0.3026028  0.11197668
##   1        6      0.1415169  0.3270726  0.10987591
##   1        7      0.1404233  0.3375773  0.10888650
##   1        8      0.1393220  0.3479701  0.10820117
##   1        9      0.1376909  0.3630413  0.10675904
##   1       10      0.1361410  0.3783478  0.10524108
##   1       11      0.1359734  0.3793368  0.10491628
##   1       12      0.1356803  0.3821798  0.10448616
##   1       13      0.1371386  0.3722615  0.10517051
##   1       14      0.1371648  0.3721012  0.10514100
##   1       15      0.1376772  0.3688911  0.10526938
##   1       16      0.1377746  0.3686955  0.10511907
##   1       17      0.1377336  0.3690273  0.10493933
##   1       18      0.1376191  0.3702063  0.10482253
##   1       19      0.1376361  0.3699936  0.10482417
##   1       20      0.1377596  0.3690091  0.10470162
##   1       21      0.1376697  0.3698763  0.10467845
##   1       22      0.1372426  0.3737903  0.10428601
##   1       23      0.1373113  0.3733069  0.10431459
##   1       24      0.1371606  0.3744426  0.10429175
##   1       25      0.1369533  0.3761997  0.10418090
##   1       26      0.1366941  0.3786725  0.10387195
##   1       27      0.1368112  0.3779475  0.10391476
##   1       28      0.1367762  0.3786085  0.10390836
##   1       29      0.1366418  0.3798313  0.10380135
##   1       30      0.1364046  0.3818173  0.10374862
##   1       31      0.1366823  0.3794642  0.10376088
##   1       32      0.1367684  0.3788260  0.10381211
##   1       33      0.1371391  0.3760662  0.10401979
##   1       34      0.1371362  0.3761394  0.10397847
##   1       35      0.1373108  0.3745209  0.10399745
##   1       36      0.1374006  0.3739643  0.10405588
##   1       37      0.1374042  0.3738738  0.10411443
##   1       38      0.1374042  0.3738738  0.10411443
##   2        2      0.1527874  0.2164850  0.11922540
##   2        3      0.1461656  0.2830313  0.11377071
##   2        4      0.1448162  0.2964531  0.11218248
##   2        5      0.1432118  0.3120895  0.11125386
##   2        6      0.1413509  0.3291617  0.10987207
##   2        7      0.1399609  0.3424904  0.10835457
##   2        8      0.1392349  0.3520761  0.10718962
##   2        9      0.1356488  0.3813322  0.10373772
##   2       10      0.1362731  0.3780690  0.10382089
##   2       11      0.1365127  0.3769165  0.10333758
##   2       12      0.1364264  0.3778924  0.10319761
##   2       13      0.1351820  0.3883888  0.10240425
##   2       14      0.1355442  0.3855695  0.10251147
##   2       15      0.1346713  0.3928119  0.10169420
##   2       16      0.1337491  0.4010124  0.10125582
##   2       17      0.1337139  0.4018183  0.10135034
##   2       18      0.1330777  0.4078082  0.10074254
##   2       19      0.1330450  0.4084463  0.10053691
##   2       20      0.1327534  0.4110547  0.10032196
##   2       21      0.1325528  0.4127307  0.10017536
##   2       22      0.1321892  0.4157352  0.09981311
##   2       23      0.1318251  0.4188860  0.09946667
##   2       24      0.1318599  0.4185769  0.09949694
##   2       25      0.1320799  0.4167887  0.09960169
##   2       26      0.1321709  0.4160313  0.09959809
##   2       27      0.1320612  0.4169283  0.09951555
##   2       28      0.1320617  0.4168964  0.09952595
##   2       29      0.1320048  0.4173713  0.09945129
##   2       30      0.1320310  0.4171650  0.09951915
##   2       31      0.1320310  0.4171650  0.09951915
##   2       32      0.1320310  0.4171650  0.09951915
##   2       33      0.1320310  0.4171650  0.09951915
##   2       34      0.1320310  0.4171650  0.09951915
##   2       35      0.1320310  0.4171650  0.09951915
##   2       36      0.1320310  0.4171650  0.09951915
##   2       37      0.1320310  0.4171650  0.09951915
##   2       38      0.1320310  0.4171650  0.09951915
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 23 and degree = 2.

NonLinear_MARS$finalModel

## Selected 23 of 29 terms, and 11 of 28 predictors (nprune=23)
## Termination condition: RSq changed by less than 0.001 at 29 terms
## Importance: Mnf.Flow, Brand.Code_C, Alch.Rel, Bowl.Setpoint, ...
## Number of terms at each degree of interaction: 1 5 17
## GCV 0.01611237    RSS 31.31483    GRSq 0.4573021    RSq 0.4859905

NonLinear_MARS_Pred <- predict(NonLinear_MARS, newdata = data_test_X)
NonLinear_MARS_metrics <- postResample(pred = NonLinear_MARS_Pred, obs = data_test_Y)
NonLinear_MARS_metrics

##       RMSE   Rsquared        MAE 
## 0.12396741 0.49036903 0.09564496

6.2.5 Neural Network

The final neural network model is size = 5, decay = 0.01, with RMSE and R2 0.11423783 and R2 0.56938536 respectively.

set.seed(0)


NonLinear_NNet <- train(data_train_X, data_train_Y,
                      method ='avNNet',
                      tuneGrid = expand.grid(.decay = seq(0.01,0.1,0.02), 
                                             .size = c(1:5),
                                             .bag = FALSE),
                      trControl = trainControl(method = "cv"),
                      trace = FALSE,
                      linout =TRUE#,
                      #MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                      #maxit = 500
                      )
NonLinear_NNet

## Model Averaged Neural Network 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE       
##   0.01   1     0.1390464  0.3530330  0.10718967
##   0.01   2     0.1434126  0.3389881  0.10782088
##   0.01   3     0.1526538  0.3942120  0.10075708
##   0.01   4     0.1257428  0.4693378  0.09528213
##   0.01   5     0.1233552  0.4889622  0.09328839
##   0.03   1     0.1386663  0.3554992  0.10775591
##   0.03   2     0.1388569  0.3613455  0.10756416
##   0.03   3     0.1315032  0.4201018  0.10026793
##   0.03   4     0.1258351  0.4704857  0.09536096
##   0.03   5     0.1247329  0.4793170  0.09451236
##   0.05   1     0.1384113  0.3582842  0.10760727
##   0.05   2     0.1418169  0.3406693  0.10996645
##   0.05   3     0.1307451  0.4301629  0.09983160
##   0.05   4     0.1269153  0.4592491  0.09681098
##   0.05   5     0.1243520  0.4819411  0.09394431
##   0.07   1     0.1387791  0.3543691  0.10793027
##   0.07   2     0.1433552  0.3242280  0.11137415
##   0.07   3     0.1307574  0.4302750  0.10031547
##   0.07   4     0.1275863  0.4536516  0.09767131
##   0.07   5     0.1249580  0.4762182  0.09519004
##   0.09   1     0.1384934  0.3572619  0.10786324
##   0.09   2     0.1388061  0.3677063  0.10781687
##   0.09   3     0.1297552  0.4408124  0.09960522
##   0.09   4     0.1263112  0.4645131  0.09599997
##   0.09   5     0.1251908  0.4737351  0.09525916
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.01 and bag
##  = FALSE.

NonLinear_NNet_Pred <- predict(NonLinear_NNet, newdata = data_test_X)
NonLinear_NNet_metrics <- postResample(pred = NonLinear_NNet_Pred, obs = data_test_Y)
NonLinear_NNet_metrics

##       RMSE   Rsquared        MAE 
## 0.11423783 0.56938536 0.08687277

6.3 Tree-Based Regression Models

6.3.1 Random Forest

The optimal mtry = 15.
The corresplonding resampled estimate of RMSE and R2 are 0.09784328 and 0.69226170 respectively.

set.seed(0)

TreeBased_RF <- train(x = data_train_X,
                  y = data_train_Y,
                  method = "rf",
                  trControl = ctrl)
TreeBased_RF

## Random Forest 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1165576  0.5859532  0.08864558
##   15    0.1046622  0.6441878  0.07596282
##   28    0.1054225  0.6312982  0.07518499
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 15.

TreeBased_RF_Pred <- predict(TreeBased_RF, newdata = data_test_X)
TreeBased_RF_metrics <- postResample(pred = TreeBased_RF_Pred, obs = data_test_Y)
TreeBased_RF_metrics

##       RMSE   Rsquared        MAE 
## 0.09784328 0.69226170 0.07327428

6.3.2 Gradient Boosting Machine

The optimal n.trees = 900, interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.
The corresplonding resampled estimate of RMSE and R2 are 0.1104675 and 0.5972602 respectively.

set.seed(0)

TreeBased_GBM <- train(x = data_train_X,
                  y = data_train_Y,
                  method = "gbm",
                  tuneGrid = expand.grid(.interaction.depth = seq(1, 7, by = 2),
                              .n.trees = seq(100, 1000, by = 50),
                              .shrinkage = c(0.01, 0.1),
                              .n.minobsinnode = c(5,10)),
                  tuneLength = 10,
                  trControl = ctrl,
                  verbose = FALSE)
TreeBased_GBM

## Stochastic Gradient Boosting 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  RMSE     
##   0.01       1                   5               100     0.1535056
##   0.01       1                   5               150     0.1490350
##   0.01       1                   5               200     0.1458946
##   0.01       1                   5               250     0.1436576
##   0.01       1                   5               300     0.1419323
##   0.01       1                   5               350     0.1406554
##   0.01       1                   5               400     0.1396556
##   0.01       1                   5               450     0.1389351
##   0.01       1                   5               500     0.1382906
##   0.01       1                   5               550     0.1378580
##   0.01       1                   5               600     0.1374626
##   0.01       1                   5               650     0.1371200
##   0.01       1                   5               700     0.1367336
##   0.01       1                   5               750     0.1364098
##   0.01       1                   5               800     0.1361145
##   0.01       1                   5               850     0.1358792
##   0.01       1                   5               900     0.1356063
##   0.01       1                   5               950     0.1353368
##   0.01       1                   5              1000     0.1351189
##   0.01       1                  10               100     0.1536465
##   0.01       1                  10               150     0.1491078
##   0.01       1                  10               200     0.1460110
##   0.01       1                  10               250     0.1437385
##   0.01       1                  10               300     0.1420114
##   0.01       1                  10               350     0.1406584
##   0.01       1                  10               400     0.1397182
##   0.01       1                  10               450     0.1389405
##   0.01       1                  10               500     0.1383112
##   0.01       1                  10               550     0.1377765
##   0.01       1                  10               600     0.1373695
##   0.01       1                  10               650     0.1370119
##   0.01       1                  10               700     0.1366433
##   0.01       1                  10               750     0.1362744
##   0.01       1                  10               800     0.1360030
##   0.01       1                  10               850     0.1357687
##   0.01       1                  10               900     0.1355482
##   0.01       1                  10               950     0.1352849
##   0.01       1                  10              1000     0.1350739
##   0.01       3                   5               100     0.1443981
##   0.01       3                   5               150     0.1387395
##   0.01       3                   5               200     0.1352581
##   0.01       3                   5               250     0.1330947
##   0.01       3                   5               300     0.1316020
##   0.01       3                   5               350     0.1304906
##   0.01       3                   5               400     0.1295551
##   0.01       3                   5               450     0.1287383
##   0.01       3                   5               500     0.1280514
##   0.01       3                   5               550     0.1274304
##   0.01       3                   5               600     0.1267289
##   0.01       3                   5               650     0.1262282
##   0.01       3                   5               700     0.1258221
##   0.01       3                   5               750     0.1254674
##   0.01       3                   5               800     0.1251154
##   0.01       3                   5               850     0.1247970
##   0.01       3                   5               900     0.1245305
##   0.01       3                   5               950     0.1243256
##   0.01       3                   5              1000     0.1240812
##   0.01       3                  10               100     0.1445042
##   0.01       3                  10               150     0.1386595
##   0.01       3                  10               200     0.1351232
##   0.01       3                  10               250     0.1329691
##   0.01       3                  10               300     0.1313775
##   0.01       3                  10               350     0.1301674
##   0.01       3                  10               400     0.1292257
##   0.01       3                  10               450     0.1284388
##   0.01       3                  10               500     0.1277175
##   0.01       3                  10               550     0.1271278
##   0.01       3                  10               600     0.1265867
##   0.01       3                  10               650     0.1260797
##   0.01       3                  10               700     0.1256559
##   0.01       3                  10               750     0.1252243
##   0.01       3                  10               800     0.1248902
##   0.01       3                  10               850     0.1245717
##   0.01       3                  10               900     0.1242475
##   0.01       3                  10               950     0.1239413
##   0.01       3                  10              1000     0.1237621
##   0.01       5                   5               100     0.1407584
##   0.01       5                   5               150     0.1347082
##   0.01       5                   5               200     0.1310276
##   0.01       5                   5               250     0.1286710
##   0.01       5                   5               300     0.1269540
##   0.01       5                   5               350     0.1257140
##   0.01       5                   5               400     0.1246789
##   0.01       5                   5               450     0.1239186
##   0.01       5                   5               500     0.1233149
##   0.01       5                   5               550     0.1226674
##   0.01       5                   5               600     0.1220399
##   0.01       5                   5               650     0.1216075
##   0.01       5                   5               700     0.1212381
##   0.01       5                   5               750     0.1208889
##   0.01       5                   5               800     0.1205311
##   0.01       5                   5               850     0.1201081
##   0.01       5                   5               900     0.1198700
##   0.01       5                   5               950     0.1196036
##   0.01       5                   5              1000     0.1193918
##   0.01       5                  10               100     0.1405693
##   0.01       5                  10               150     0.1344117
##   0.01       5                  10               200     0.1307994
##   0.01       5                  10               250     0.1282400
##   0.01       5                  10               300     0.1265838
##   0.01       5                  10               350     0.1253331
##   0.01       5                  10               400     0.1242838
##   0.01       5                  10               450     0.1234494
##   0.01       5                  10               500     0.1226329
##   0.01       5                  10               550     0.1219982
##   0.01       5                  10               600     0.1214775
##   0.01       5                  10               650     0.1209410
##   0.01       5                  10               700     0.1206051
##   0.01       5                  10               750     0.1201725
##   0.01       5                  10               800     0.1198850
##   0.01       5                  10               850     0.1195760
##   0.01       5                  10               900     0.1192327
##   0.01       5                  10               950     0.1190077
##   0.01       5                  10              1000     0.1188001
##   0.01       7                   5               100     0.1384275
##   0.01       7                   5               150     0.1320876
##   0.01       7                   5               200     0.1282591
##   0.01       7                   5               250     0.1257032
##   0.01       7                   5               300     0.1239902
##   0.01       7                   5               350     0.1225967
##   0.01       7                   5               400     0.1216320
##   0.01       7                   5               450     0.1206244
##   0.01       7                   5               500     0.1199510
##   0.01       7                   5               550     0.1193445
##   0.01       7                   5               600     0.1187967
##   0.01       7                   5               650     0.1184142
##   0.01       7                   5               700     0.1180329
##   0.01       7                   5               750     0.1176940
##   0.01       7                   5               800     0.1173624
##   0.01       7                   5               850     0.1170728
##   0.01       7                   5               900     0.1167934
##   0.01       7                   5               950     0.1165117
##   0.01       7                   5              1000     0.1162684
##   0.01       7                  10               100     0.1381935
##   0.01       7                  10               150     0.1316462
##   0.01       7                  10               200     0.1276732
##   0.01       7                  10               250     0.1251146
##   0.01       7                  10               300     0.1232905
##   0.01       7                  10               350     0.1220781
##   0.01       7                  10               400     0.1210549
##   0.01       7                  10               450     0.1202724
##   0.01       7                  10               500     0.1195638
##   0.01       7                  10               550     0.1189483
##   0.01       7                  10               600     0.1183244
##   0.01       7                  10               650     0.1179829
##   0.01       7                  10               700     0.1175881
##   0.01       7                  10               750     0.1172371
##   0.01       7                  10               800     0.1169624
##   0.01       7                  10               850     0.1166893
##   0.01       7                  10               900     0.1163924
##   0.01       7                  10               950     0.1161454
##   0.01       7                  10              1000     0.1159055
##   0.10       1                   5               100     0.1353870
##   0.10       1                   5               150     0.1340044
##   0.10       1                   5               200     0.1331607
##   0.10       1                   5               250     0.1325162
##   0.10       1                   5               300     0.1322091
##   0.10       1                   5               350     0.1322188
##   0.10       1                   5               400     0.1319665
##   0.10       1                   5               450     0.1318188
##   0.10       1                   5               500     0.1319638
##   0.10       1                   5               550     0.1320879
##   0.10       1                   5               600     0.1321047
##   0.10       1                   5               650     0.1319325
##   0.10       1                   5               700     0.1323059
##   0.10       1                   5               750     0.1323037
##   0.10       1                   5               800     0.1322896
##   0.10       1                   5               850     0.1326825
##   0.10       1                   5               900     0.1326673
##   0.10       1                   5               950     0.1326552
##   0.10       1                   5              1000     0.1326691
##   0.10       1                  10               100     0.1353808
##   0.10       1                  10               150     0.1335157
##   0.10       1                  10               200     0.1326322
##   0.10       1                  10               250     0.1321384
##   0.10       1                  10               300     0.1319427
##   0.10       1                  10               350     0.1319482
##   0.10       1                  10               400     0.1320047
##   0.10       1                  10               450     0.1317027
##   0.10       1                  10               500     0.1320438
##   0.10       1                  10               550     0.1319677
##   0.10       1                  10               600     0.1317611
##   0.10       1                  10               650     0.1320805
##   0.10       1                  10               700     0.1319790
##   0.10       1                  10               750     0.1319191
##   0.10       1                  10               800     0.1318749
##   0.10       1                  10               850     0.1322117
##   0.10       1                  10               900     0.1322141
##   0.10       1                  10               950     0.1323673
##   0.10       1                  10              1000     0.1324594
##   0.10       3                   5               100     0.1247993
##   0.10       3                   5               150     0.1239652
##   0.10       3                   5               200     0.1232386
##   0.10       3                   5               250     0.1225885
##   0.10       3                   5               300     0.1222911
##   0.10       3                   5               350     0.1220773
##   0.10       3                   5               400     0.1220112
##   0.10       3                   5               450     0.1215983
##   0.10       3                   5               500     0.1211350
##   0.10       3                   5               550     0.1213181
##   0.10       3                   5               600     0.1209272
##   0.10       3                   5               650     0.1206428
##   0.10       3                   5               700     0.1205694
##   0.10       3                   5               750     0.1204253
##   0.10       3                   5               800     0.1203508
##   0.10       3                   5               850     0.1201017
##   0.10       3                   5               900     0.1202024
##   0.10       3                   5               950     0.1203198
##   0.10       3                   5              1000     0.1201014
##   0.10       3                  10               100     0.1246780
##   0.10       3                  10               150     0.1234524
##   0.10       3                  10               200     0.1226146
##   0.10       3                  10               250     0.1216866
##   0.10       3                  10               300     0.1209524
##   0.10       3                  10               350     0.1208231
##   0.10       3                  10               400     0.1208755
##   0.10       3                  10               450     0.1206491
##   0.10       3                  10               500     0.1206228
##   0.10       3                  10               550     0.1204732
##   0.10       3                  10               600     0.1202193
##   0.10       3                  10               650     0.1201130
##   0.10       3                  10               700     0.1202734
##   0.10       3                  10               750     0.1203011
##   0.10       3                  10               800     0.1200271
##   0.10       3                  10               850     0.1201333
##   0.10       3                  10               900     0.1202030
##   0.10       3                  10               950     0.1201546
##   0.10       3                  10              1000     0.1202208
##   0.10       5                   5               100     0.1212583
##   0.10       5                   5               150     0.1197410
##   0.10       5                   5               200     0.1192567
##   0.10       5                   5               250     0.1183344
##   0.10       5                   5               300     0.1177544
##   0.10       5                   5               350     0.1172734
##   0.10       5                   5               400     0.1170780
##   0.10       5                   5               450     0.1170512
##   0.10       5                   5               500     0.1170092
##   0.10       5                   5               550     0.1169448
##   0.10       5                   5               600     0.1168782
##   0.10       5                   5               650     0.1170498
##   0.10       5                   5               700     0.1169686
##   0.10       5                   5               750     0.1169860
##   0.10       5                   5               800     0.1168449
##   0.10       5                   5               850     0.1166501
##   0.10       5                   5               900     0.1166432
##   0.10       5                   5               950     0.1166941
##   0.10       5                   5              1000     0.1166490
##   0.10       5                  10               100     0.1203285
##   0.10       5                  10               150     0.1186437
##   0.10       5                  10               200     0.1178004
##   0.10       5                  10               250     0.1173107
##   0.10       5                  10               300     0.1169347
##   0.10       5                  10               350     0.1164742
##   0.10       5                  10               400     0.1160615
##   0.10       5                  10               450     0.1158243
##   0.10       5                  10               500     0.1154760
##   0.10       5                  10               550     0.1153967
##   0.10       5                  10               600     0.1152440
##   0.10       5                  10               650     0.1151973
##   0.10       5                  10               700     0.1150863
##   0.10       5                  10               750     0.1151455
##   0.10       5                  10               800     0.1148742
##   0.10       5                  10               850     0.1149837
##   0.10       5                  10               900     0.1146944
##   0.10       5                  10               950     0.1148337
##   0.10       5                  10              1000     0.1148119
##   0.10       7                   5               100     0.1184026
##   0.10       7                   5               150     0.1179315
##   0.10       7                   5               200     0.1180765
##   0.10       7                   5               250     0.1179658
##   0.10       7                   5               300     0.1175782
##   0.10       7                   5               350     0.1170519
##   0.10       7                   5               400     0.1168684
##   0.10       7                   5               450     0.1165944
##   0.10       7                   5               500     0.1164862
##   0.10       7                   5               550     0.1165281
##   0.10       7                   5               600     0.1164659
##   0.10       7                   5               650     0.1164430
##   0.10       7                   5               700     0.1162906
##   0.10       7                   5               750     0.1163773
##   0.10       7                   5               800     0.1164120
##   0.10       7                   5               850     0.1164677
##   0.10       7                   5               900     0.1165455
##   0.10       7                   5               950     0.1165966
##   0.10       7                   5              1000     0.1165314
##   0.10       7                  10               100     0.1175349
##   0.10       7                  10               150     0.1170537
##   0.10       7                  10               200     0.1166296
##   0.10       7                  10               250     0.1164940
##   0.10       7                  10               300     0.1162592
##   0.10       7                  10               350     0.1162896
##   0.10       7                  10               400     0.1160903
##   0.10       7                  10               450     0.1158844
##   0.10       7                  10               500     0.1161168
##   0.10       7                  10               550     0.1159057
##   0.10       7                  10               600     0.1157356
##   0.10       7                  10               650     0.1155581
##   0.10       7                  10               700     0.1154481
##   0.10       7                  10               750     0.1153436
##   0.10       7                  10               800     0.1156930
##   0.10       7                  10               850     0.1156195
##   0.10       7                  10               900     0.1155034
##   0.10       7                  10               950     0.1156759
##   0.10       7                  10              1000     0.1157331
##   Rsquared   MAE       
##   0.2923043  0.12104582
##   0.3175356  0.11713260
##   0.3345305  0.11446047
##   0.3457517  0.11262829
##   0.3539033  0.11130744
##   0.3601665  0.11020835
##   0.3651888  0.10940607
##   0.3693003  0.10887048
##   0.3730780  0.10838540
##   0.3758276  0.10801547
##   0.3780504  0.10770114
##   0.3807034  0.10741898
##   0.3834791  0.10710048
##   0.3859153  0.10683060
##   0.3881325  0.10657080
##   0.3898195  0.10634838
##   0.3920246  0.10610349
##   0.3939641  0.10585506
##   0.3953937  0.10564788
##   0.2901715  0.12113708
##   0.3173627  0.11721307
##   0.3324042  0.11449854
##   0.3442850  0.11266653
##   0.3530893  0.11133265
##   0.3596855  0.11022095
##   0.3647388  0.10947421
##   0.3687133  0.10884475
##   0.3724134  0.10833251
##   0.3762077  0.10791443
##   0.3788602  0.10762313
##   0.3814104  0.10728577
##   0.3841774  0.10699975
##   0.3866392  0.10666650
##   0.3886066  0.10641055
##   0.3902346  0.10622005
##   0.3918679  0.10600871
##   0.3938946  0.10576644
##   0.3955024  0.10553620
##   0.3895942  0.11361467
##   0.4085464  0.10883339
##   0.4221029  0.10593687
##   0.4322963  0.10407672
##   0.4401262  0.10275747
##   0.4464604  0.10176121
##   0.4524626  0.10089176
##   0.4575277  0.10012256
##   0.4619026  0.09946874
##   0.4662053  0.09887681
##   0.4714428  0.09827623
##   0.4750793  0.09783409
##   0.4776138  0.09744715
##   0.4799069  0.09709940
##   0.4823535  0.09674901
##   0.4844402  0.09642783
##   0.4861647  0.09615482
##   0.4873754  0.09593639
##   0.4890149  0.09567516
##   0.3877509  0.11364208
##   0.4095044  0.10877600
##   0.4228489  0.10593533
##   0.4328778  0.10405052
##   0.4421676  0.10273885
##   0.4491202  0.10165845
##   0.4548781  0.10077391
##   0.4600134  0.10005415
##   0.4650236  0.09932900
##   0.4689388  0.09871916
##   0.4725248  0.09816326
##   0.4759486  0.09764840
##   0.4788238  0.09718342
##   0.4817735  0.09675055
##   0.4839559  0.09636641
##   0.4862928  0.09609419
##   0.4885961  0.09572951
##   0.4907546  0.09543445
##   0.4919536  0.09523381
##   0.4296328  0.11056944
##   0.4465845  0.10542151
##   0.4595223  0.10223897
##   0.4698218  0.10009360
##   0.4794847  0.09860618
##   0.4862967  0.09744553
##   0.4927006  0.09649515
##   0.4973343  0.09570717
##   0.5007846  0.09510394
##   0.5049475  0.09443334
##   0.5090950  0.09384399
##   0.5117210  0.09335976
##   0.5139853  0.09298451
##   0.5162901  0.09264970
##   0.5187712  0.09233416
##   0.5216779  0.09195232
##   0.5232556  0.09171237
##   0.5249086  0.09144184
##   0.5262414  0.09120052
##   0.4305419  0.11040402
##   0.4496975  0.10519122
##   0.4622780  0.10210678
##   0.4745939  0.09982927
##   0.4829701  0.09830628
##   0.4898305  0.09716037
##   0.4960863  0.09617146
##   0.5011013  0.09532461
##   0.5062142  0.09453153
##   0.5101728  0.09390248
##   0.5133029  0.09335901
##   0.5171049  0.09284134
##   0.5192795  0.09246480
##   0.5219163  0.09200919
##   0.5236911  0.09174965
##   0.5257575  0.09142983
##   0.5281709  0.09110828
##   0.5296959  0.09084697
##   0.5309683  0.09059011
##   0.4553393  0.10873165
##   0.4715015  0.10322464
##   0.4840593  0.09986379
##   0.4955013  0.09761230
##   0.5037097  0.09601242
##   0.5112731  0.09465708
##   0.5169049  0.09363113
##   0.5232551  0.09268454
##   0.5271739  0.09195952
##   0.5309892  0.09133709
##   0.5346564  0.09079435
##   0.5369050  0.09041760
##   0.5390687  0.08998010
##   0.5410202  0.08963275
##   0.5431144  0.08930669
##   0.5450597  0.08901503
##   0.5467731  0.08875337
##   0.5487468  0.08850885
##   0.5503899  0.08830302
##   0.4579059  0.10834217
##   0.4765811  0.10278031
##   0.4897284  0.09926566
##   0.5007878  0.09692264
##   0.5099010  0.09521690
##   0.5154119  0.09405683
##   0.5212298  0.09301995
##   0.5255340  0.09224600
##   0.5296746  0.09152930
##   0.5334226  0.09091125
##   0.5373467  0.09030151
##   0.5391987  0.08993093
##   0.5415483  0.08946552
##   0.5437602  0.08909500
##   0.5455212  0.08878540
##   0.5473915  0.08851441
##   0.5493444  0.08819497
##   0.5510494  0.08794372
##   0.5526001  0.08768313
##   0.3925706  0.10578607
##   0.4009384  0.10441821
##   0.4073512  0.10356795
##   0.4121818  0.10286105
##   0.4147433  0.10252442
##   0.4148174  0.10231741
##   0.4170615  0.10200739
##   0.4186291  0.10180281
##   0.4173253  0.10176632
##   0.4168509  0.10178692
##   0.4171331  0.10169856
##   0.4191245  0.10159296
##   0.4156994  0.10191771
##   0.4163146  0.10185401
##   0.4170539  0.10192938
##   0.4138965  0.10206631
##   0.4142706  0.10204338
##   0.4145678  0.10190521
##   0.4147332  0.10185310
##   0.3921931  0.10586555
##   0.4072644  0.10407100
##   0.4136823  0.10302749
##   0.4161934  0.10222333
##   0.4176470  0.10183855
##   0.4179291  0.10164997
##   0.4172452  0.10159452
##   0.4196887  0.10133641
##   0.4169207  0.10150627
##   0.4182346  0.10139667
##   0.4198290  0.10107966
##   0.4171574  0.10112351
##   0.4181224  0.10117512
##   0.4190221  0.10095109
##   0.4197240  0.10083218
##   0.4169570  0.10094134
##   0.4164783  0.10101379
##   0.4157212  0.10097037
##   0.4151814  0.10098202
##   0.4816502  0.09612469
##   0.4868661  0.09486796
##   0.4913976  0.09386725
##   0.4968902  0.09321608
##   0.4999181  0.09285739
##   0.5016669  0.09260927
##   0.5026058  0.09225028
##   0.5067216  0.09199024
##   0.5100292  0.09154079
##   0.5092293  0.09162251
##   0.5125149  0.09122717
##   0.5148562  0.09089508
##   0.5157139  0.09088356
##   0.5173058  0.09063209
##   0.5182242  0.09069230
##   0.5202329  0.09060502
##   0.5195152  0.09066987
##   0.5188649  0.09053892
##   0.5206071  0.09044505
##   0.4819876  0.09626836
##   0.4904744  0.09454040
##   0.4967831  0.09377631
##   0.5045420  0.09303392
##   0.5104180  0.09212121
##   0.5116055  0.09181519
##   0.5111501  0.09173337
##   0.5131878  0.09165600
##   0.5138878  0.09147914
##   0.5156730  0.09137932
##   0.5175690  0.09106740
##   0.5185697  0.09105451
##   0.5177694  0.09111288
##   0.5178604  0.09125541
##   0.5197956  0.09083979
##   0.5190885  0.09096936
##   0.5189439  0.09100255
##   0.5192169  0.09083851
##   0.5190606  0.09082384
##   0.5082410  0.09219655
##   0.5199920  0.09033539
##   0.5226496  0.08951928
##   0.5302999  0.08894231
##   0.5350123  0.08839156
##   0.5388060  0.08805741
##   0.5407253  0.08783215
##   0.5409549  0.08772513
##   0.5418992  0.08763445
##   0.5425453  0.08759442
##   0.5428591  0.08757097
##   0.5418499  0.08771346
##   0.5427772  0.08765661
##   0.5428787  0.08758450
##   0.5443327  0.08749978
##   0.5455025  0.08728405
##   0.5460617  0.08722536
##   0.5459019  0.08738157
##   0.5462553  0.08736333
##   0.5157180  0.09190604
##   0.5283063  0.09030123
##   0.5342711  0.08958764
##   0.5382390  0.08888796
##   0.5414564  0.08830597
##   0.5452703  0.08796681
##   0.5488859  0.08731427
##   0.5513184  0.08694026
##   0.5541820  0.08661303
##   0.5551141  0.08667164
##   0.5565919  0.08631202
##   0.5569887  0.08610650
##   0.5580502  0.08590139
##   0.5579847  0.08578173
##   0.5601827  0.08565956
##   0.5593102  0.08578147
##   0.5617171  0.08551281
##   0.5610253  0.08569260
##   0.5614420  0.08571081
##   0.5312223  0.08936245
##   0.5350188  0.08859739
##   0.5341844  0.08839159
##   0.5354758  0.08786617
##   0.5386357  0.08724049
##   0.5429063  0.08694732
##   0.5444806  0.08656257
##   0.5469085  0.08651211
##   0.5481268  0.08621271
##   0.5481705  0.08636112
##   0.5488321  0.08636786
##   0.5491554  0.08641019
##   0.5504592  0.08633135
##   0.5500163  0.08634585
##   0.5499226  0.08638559
##   0.5497237  0.08636256
##   0.5493203  0.08638006
##   0.5490384  0.08638911
##   0.5496591  0.08631969
##   0.5379357  0.08898425
##   0.5416412  0.08834651
##   0.5448512  0.08767391
##   0.5458605  0.08698501
##   0.5480120  0.08693774
##   0.5481531  0.08704799
##   0.5502196  0.08686922
##   0.5522713  0.08686386
##   0.5508473  0.08707606
##   0.5526465  0.08682283
##   0.5542019  0.08661854
##   0.5556558  0.08641946
##   0.5565818  0.08623262
##   0.5573549  0.08605023
##   0.5550600  0.08633598
##   0.5557930  0.08638667
##   0.5567567  0.08629577
##   0.5556182  0.08635852
##   0.5552863  0.08638720
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 900,
##  interaction.depth = 5, shrinkage = 0.1 and n.minobsinnode = 10.

TreeBased_GBM_Pred <- predict(TreeBased_GBM, newdata = data_test_X)
TreeBased_GBM_metrics <- postResample(pred = TreeBased_GBM_Pred, obs = data_test_Y)
TreeBased_GBM_metrics

##      RMSE  Rsquared       MAE 
## 0.1104675 0.5972602 0.0845282

6.3.3 Cubist

The optimal committees = 20 and neighbors = 5.
The corresplonding resampled estimate of RMSE and R2 are 0.09987318 and 0.67114775 respectively.

set.seed(0)

TreeBased_Cubist <- train(x = data_train_X,
                  y = data_train_Y,
                  method = "cubist",
                  trControl = trainControl(method = 'cv'))
TreeBased_Cubist

## Cubist 
## 
## 2054 samples
##   28 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1849, 1849, 1849, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE       
##    1          0          0.1286602  0.4755080  0.08985362
##    1          5          0.1239531  0.5287986  0.08520446
##    1          9          0.1236001  0.5245257  0.08516113
##   10          0          0.1117588  0.5832072  0.08091727
##   10          5          0.1054796  0.6275221  0.07473292
##   10          9          0.1054210  0.6270252  0.07520324
##   20          0          0.1107426  0.5919454  0.08024516
##   20          5          0.1042786  0.6350231  0.07382106
##   20          9          0.1043447  0.6342982  0.07434306
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

TreeBased_Cubist_Pred <- predict(TreeBased_Cubist, newdata = data_test_X)
TreeBased_Cubist_metrics <- postResample(pred = TreeBased_Cubist_Pred, obs = data_test_Y)
TreeBased_Cubist_metrics

##       RMSE   Rsquared        MAE 
## 0.09987318 0.67114775 0.07325504

7 Model Selection

The SVM-Radial model has both lowest RMSE and highest R2, therefore it is selected to be the best model.

rbind(Linear_PLS_metrics,
      Linear_Ridge_metrics,
      Linear_LASSO_metrics,
      Linear_eNet_metrics,
      NonLinear_KNN_metrics,
      NonLinear_SVMLinear_metrics,
      NonLinear_SVMRadial_metrics,
      NonLinear_MARS_metrics,
      NonLinear_NNet_metrics,
      TreeBased_RF_metrics,
      TreeBased_GBM_metrics,
      TreeBased_Cubist_metrics
      ) %>%
  data.frame() %>%
  arrange(RMSE)

8 Prediction on Evaluation Data

PH_Pred <- predict(NonLinear_SVMRadial, newdata = data_eval_X)

df_pred <- cbind(df_eval %>% select(-PH), PH_Pred)

9 Export Prediction as CSV

write_csv(df_pred, 'D://DATA SCIENCE//DATA 624 SPRING 2021//Project 2//StudentEvaluation_Prediction.csv')

DATA 624 - PROJECT 2

DATA 624 - PROJECT 2

1 Introduction

2 Load Package

3 Load Data

4 Exploratory data analysis

4.1 Training Data Summary

4.2 Evaluation Data Summary

4.3 Missing Value View

4.4 Numerical Predictor Correlation after Missing Data Imputation

5 Data Preprocess

6 Model building

6.1 Linear Regression Models

6.1.1 PLS Regression

6.1.2 Ridge Regression

6.1.3 LASSO

6.1.4 Elastic Net

6.2 Non-Linear Regression Models

6.2.1 KNN

6.2.2 SVM-Linear

6.2.3 SVM-Radial

6.2.4 MARS

6.2.5 Neural Network

6.3 Tree-Based Regression Models

6.3.1 Random Forest

6.3.2 Gradient Boosting Machine

6.3.3 Cubist

7 Model Selection

8 Prediction on Evaluation Data

9 Export Prediction as CSV