DATA624 Project 2

Project Requirements

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH. Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach. Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Exploratory Data Analysis

A summary of the data set-student data can be found below:

#Load library
library(fpp3)
library(ggplot2)
library(feasts)
library(readxl)
library(openxlsx)
library(imputeTS)
library(caret)
library(corrplot)
library(e1071)

#Read data
beverage<-read_xlsx('StudentData.xlsx')#|>

#Find out all the NAs
beverage|>
  skimr::skim()|>
  filter(n_missing>0)|>
  arrange(desc(n_missing))#|>

Data summary
Name	beverage
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	30
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

  #select(-PH)

There are 32 predictors and 1 response. All predictors are numeric variables except the variable Brand Code which is character

Additional findings on data: 1. There were many NAs in each column. 2. Only a small number of predictors are normally distributed.

Using the preProcess with KNN for imputation

preProcValues<-preProcess(as.data.frame(beverage), method="knnImpute", k=5, knnSummary=mean)
impute_beverage<-predict(preProcValues, beverage, na.action=na.pass)
procNames <- data.frame(col = names(preProcValues$mean), mean = preProcValues$mean, sd = preProcValues$std)
for(i in procNames$col){
  impute_beverage[i] <- impute_beverage[i]*preProcValues$std[i]+preProcValues$mean[i]
  }

library(imputeMissings)
# save the result as another object
impute_beverage_c <- impute(impute_beverage, method = "median/mode")

# check if there is any NAs

skimr::skim(impute_beverage_c)

Data summary
Name	impute_beverage_c
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand.Code	0	1	1	1	0	4	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	1	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill.Ounces	1	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC.Volume	1	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb.Pressure	1	68.21	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb.Temp	1	141.12	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	1	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC.Fill	1	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC.CO2	1	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf.Flow	1	24.47	119.49	-100.20	-100.00	64.80	140.80	229.40	▇▁▁▇▂
Carb.Pressure1	1	122.54	4.74	105.60	118.80	123.20	125.40	140.20	▁▅▇▂▁
Fill.Pressure	1	47.93	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd.Pressure1	1	12.46	12.42	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd.Pressure2	1	20.96	16.37	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd.Pressure3	1	20.43	15.95	-1.20	0.00	27.40	33.20	50.00	▇▁▃▇▁
Hyd.Pressure4	1	96.37	13.09	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler.Level	1	109.24	15.68	55.80	98.40	118.40	120.00	161.20	▁▃▅▇▁
Filler.Speed	1	3681.24	767.26	998.00	3866.80	3980.00	3998.00	4030.00	▁▁▁▁▇
Temperature	1	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage.cont	1	20.99	2.98	12.08	18.36	21.80	23.76	25.90	▁▃▅▃▇
Carb.Flow	1	2467.97	1073.38	26.00	1151.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	1	690.08	92.36	31.40	697.60	722.20	730.40	868.60	▁▁▁▂▇
Balling	1	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure.Vacuum	1	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	1	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen.Filler	1	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl.Setpoint	1	109.30	15.32	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure.Setpoint	1	47.61	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air.Pressurer	1	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch.Rel	1	6.90	0.50	5.28	6.54	6.56	7.23	8.62	▁▇▂▃▁
Carb.Rel	1	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▆▂▁
Balling.Lvl	1	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

The impute_beverage is normalized and is then de-normalized to get the original data back. However, the preProcess function of Caret packages only works for numerical values. The function impute from “imputeMissings package” is used to fill up the remaining missing values. There were 120 missing values in Brand Code, which is less than 5% of the data, it should be fine to impute in this way.

The data is further reviewed to look for correlation among variables.

td<-impute_beverage_c[,-1]

cor_res <- cor(td, use = "na.or.complete")
corrplot(cor_res, 
        type = "lower",
        order = "original",
        tl.col = "Blue",
        tl.srt = 45,
        tl.cex = 0.5
)

#Filter out high correlation variables
highCorr <- findCorrelation(cor_res, cutoff = .75)
length(highCorr)

## [1] 10

filtered_impute_beverage_c <- impute_beverage_c[, -highCorr]
skimr::skim(filtered_impute_beverage_c)

Data summary
Name	filtered_impute_beverage_…
Number of rows	2571
Number of columns	23
_______________________
Column type frequency:
character	1
numeric	22
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand.Code	0	1	1	1	0	4	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	1	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill.Ounces	1	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC.Volume	1	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb.Temp	1	141.12	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	1	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC.Fill	1	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
Mnf.Flow	1	24.47	119.49	-100.20	-100.00	64.80	140.80	229.40	▇▁▁▇▂
Carb.Pressure1	1	122.54	4.74	105.60	118.80	123.20	125.40	140.20	▁▅▇▂▁
Fill.Pressure	1	47.93	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd.Pressure1	1	12.46	12.42	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd.Pressure3	1	20.43	15.95	-1.20	0.00	27.40	33.20	50.00	▇▁▃▇▁
Filler.Speed	1	3681.24	767.26	998.00	3866.80	3980.00	3998.00	4030.00	▁▁▁▁▇
Temperature	1	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage.cont	1	20.99	2.98	12.08	18.36	21.80	23.76	25.90	▁▃▅▃▇
Density	1	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
Balling	1	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure.Vacuum	1	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	1	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen.Filler	1	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl.Setpoint	1	109.30	15.32	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure.Setpoint	1	47.61	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Balling.Lvl	1	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

10 high correlated variables are removed.

Check normality of all data

filtered_impute_beverage_c|>
  gather(key='predictor',value = 'value')|>
  ggplot(aes(x=value))+
  geom_bar()+
  facet_wrap(~predictor, scales='free')

The normality of PH is ok.

Take a first look on the distribution of response PH before splitting data. Then splitting the data into a training and a test set by a ratio of 80-20

set.seed(2744)

#Remove a outliner from the result in OLS

rout<-c(1094,1719,2359)
filtered_impute_beverage_c<-filtered_impute_beverage_c[-rout,]

fold <- filtered_impute_beverage_c$PH %>%
  createDataPartition(p = 0.8, list = FALSE, times = 1)

#Create training and testing set
beverage_train<-filtered_impute_beverage_c[fold,-19]
beverage_test<-filtered_impute_beverage_c[-fold,-19]
PH_train<-filtered_impute_beverage_c[fold,19]
PH_test<-filtered_impute_beverage_c[-fold,19]

Machine Learning Models

Start with OLS

#Try the traditional linear regression


yb<-as.data.frame(cbind(PH_train, beverage_train))
lmod<-lm(PH_train~., yb)

summary(lmod)

## 
## Call:
## lm(formula = PH_train ~ ., data = yb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52590 -0.07900  0.01000  0.08871  0.39979 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.183e+01  8.821e-01  13.413  < 2e-16 ***
## Brand.CodeB        8.741e-02  2.372e-02   3.685 0.000234 ***
## Brand.CodeC       -5.461e-02  2.386e-02  -2.289 0.022210 *  
## Brand.CodeD        8.914e-02  1.214e-02   7.344 2.98e-13 ***
## Carb.Volume       -4.560e-02  5.024e-02  -0.908 0.364177    
## Fill.Ounces       -1.122e-01  3.517e-02  -3.189 0.001450 ** 
## PC.Volume         -3.786e-02  5.793e-02  -0.654 0.513490    
## Carb.Temp          7.290e-04  7.390e-04   0.986 0.324028    
## PSC               -1.272e-01  6.369e-02  -1.996 0.046024 *  
## PSC.Fill          -4.044e-02  2.574e-02  -1.571 0.116359    
## Mnf.Flow          -7.718e-04  4.990e-05 -15.467  < 2e-16 ***
## Carb.Pressure1     5.836e-03  7.423e-04   7.862 6.09e-15 ***
## Fill.Pressure      2.309e-03  1.292e-03   1.787 0.074070 .  
## Hyd.Pressure1     -1.265e-04  3.592e-04  -0.352 0.724824    
## Hyd.Pressure3      2.796e-03  4.352e-04   6.425 1.64e-10 ***
## Filler.Speed       5.210e-06  4.939e-06   1.055 0.291678    
## Temperature       -1.675e-02  2.500e-03  -6.699 2.70e-11 ***
## Usage.cont        -8.332e-03  1.206e-03  -6.907 6.60e-12 ***
## Density           -1.042e-01  2.946e-02  -3.535 0.000416 ***
## Balling           -1.093e-01  2.487e-02  -4.393 1.17e-05 ***
## Pressure.Vacuum   -1.659e-02  8.026e-03  -2.067 0.038845 *  
## Oxygen.Filler     -4.096e-01  8.052e-02  -5.087 3.97e-07 ***
## Bowl.Setpoint      1.817e-03  2.759e-04   6.586 5.75e-11 ***
## Pressure.Setpoint -7.249e-03  2.090e-03  -3.469 0.000533 ***
## Balling.Lvl        1.725e-01  2.487e-02   6.936 5.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1318 on 2031 degrees of freedom
## Multiple R-squared:  0.4101, Adjusted R-squared:  0.4031 
## F-statistic: 58.83 on 24 and 2031 DF,  p-value: < 2.2e-16

#As there are way too many predictors, run the step function to minimize the predictor
step(lmod)

## Start:  AIC=-8309.07
## PH_train ~ Brand.Code + Carb.Volume + Fill.Ounces + PC.Volume + 
##     Carb.Temp + PSC + PSC.Fill + Mnf.Flow + Carb.Pressure1 + 
##     Fill.Pressure + Hyd.Pressure1 + Hyd.Pressure3 + Filler.Speed + 
##     Temperature + Usage.cont + Density + Balling + Pressure.Vacuum + 
##     Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl
## 
##                     Df Sum of Sq    RSS     AIC
## - Hyd.Pressure1      1    0.0022 35.265 -8310.9
## - PC.Volume          1    0.0074 35.270 -8310.6
## - Carb.Volume        1    0.0143 35.277 -8310.2
## - Carb.Temp          1    0.0169 35.279 -8310.1
## - Filler.Speed       1    0.0193 35.282 -8309.9
## <none>                           35.263 -8309.1
## - PSC.Fill           1    0.0428 35.305 -8308.6
## - Fill.Pressure      1    0.0555 35.318 -8307.8
## - PSC                1    0.0692 35.332 -8307.0
## - Pressure.Vacuum    1    0.0742 35.337 -8306.7
## - Fill.Ounces        1    0.1766 35.439 -8300.8
## - Pressure.Setpoint  1    0.2089 35.472 -8298.9
## - Density            1    0.2170 35.480 -8298.5
## - Balling            1    0.3351 35.598 -8291.6
## - Oxygen.Filler      1    0.4493 35.712 -8285.0
## - Hyd.Pressure3      1    0.7168 35.979 -8269.7
## - Bowl.Setpoint      1    0.7530 36.016 -8267.6
## - Temperature        1    0.7792 36.042 -8266.1
## - Usage.cont         1    0.8283 36.091 -8263.3
## - Balling.Lvl        1    0.8352 36.098 -8262.9
## - Carb.Pressure1     1    1.0731 36.336 -8249.4
## - Mnf.Flow           1    4.1533 39.416 -8082.1
## - Brand.Code         3    4.4929 39.755 -8068.5
## 
## Step:  AIC=-8310.94
## PH_train ~ Brand.Code + Carb.Volume + Fill.Ounces + PC.Volume + 
##     Carb.Temp + PSC + PSC.Fill + Mnf.Flow + Carb.Pressure1 + 
##     Fill.Pressure + Hyd.Pressure3 + Filler.Speed + Temperature + 
##     Usage.cont + Density + Balling + Pressure.Vacuum + Oxygen.Filler + 
##     Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl
## 
##                     Df Sum of Sq    RSS     AIC
## - PC.Volume          1    0.0095 35.274 -8312.4
## - Carb.Volume        1    0.0138 35.279 -8312.1
## - Carb.Temp          1    0.0177 35.282 -8311.9
## - Filler.Speed       1    0.0214 35.286 -8311.7
## <none>                           35.265 -8310.9
## - PSC.Fill           1    0.0429 35.308 -8310.4
## - Fill.Pressure      1    0.0591 35.324 -8309.5
## - PSC                1    0.0683 35.333 -8309.0
## - Pressure.Vacuum    1    0.0775 35.342 -8308.4
## - Fill.Ounces        1    0.1764 35.441 -8302.7
## - Pressure.Setpoint  1    0.2106 35.475 -8300.7
## - Density            1    0.2254 35.490 -8299.8
## - Balling            1    0.3332 35.598 -8293.6
## - Oxygen.Filler      1    0.4479 35.713 -8287.0
## - Temperature        1    0.7771 36.042 -8268.1
## - Bowl.Setpoint      1    0.7988 36.064 -8266.9
## - Usage.cont         1    0.8302 36.095 -8265.1
## - Balling.Lvl        1    0.8330 36.098 -8264.9
## - Hyd.Pressure3      1    1.0698 36.335 -8251.5
## - Carb.Pressure1     1    1.1133 36.378 -8249.0
## - Mnf.Flow           1    4.1519 39.417 -8084.1
## - Brand.Code         3    4.4947 39.759 -8070.3
## 
## Step:  AIC=-8312.39
## PH_train ~ Brand.Code + Carb.Volume + Fill.Ounces + Carb.Temp + 
##     PSC + PSC.Fill + Mnf.Flow + Carb.Pressure1 + Fill.Pressure + 
##     Hyd.Pressure3 + Filler.Speed + Temperature + Usage.cont + 
##     Density + Balling + Pressure.Vacuum + Oxygen.Filler + Bowl.Setpoint + 
##     Pressure.Setpoint + Balling.Lvl
## 
##                     Df Sum of Sq    RSS     AIC
## - Carb.Volume        1    0.0126 35.287 -8313.7
## - Carb.Temp          1    0.0177 35.292 -8313.4
## - Filler.Speed       1    0.0216 35.296 -8313.1
## <none>                           35.274 -8312.4
## - PSC.Fill           1    0.0397 35.314 -8312.1
## - Fill.Pressure      1    0.0635 35.338 -8310.7
## - Pressure.Vacuum    1    0.0716 35.346 -8310.2
## - PSC                1    0.0881 35.362 -8309.3
## - Fill.Ounces        1    0.1693 35.444 -8304.5
## - Pressure.Setpoint  1    0.2223 35.497 -8301.5
## - Density            1    0.2384 35.513 -8300.5
## - Balling            1    0.3290 35.603 -8295.3
## - Oxygen.Filler      1    0.4595 35.734 -8287.8
## - Temperature        1    0.7833 36.058 -8269.2
## - Bowl.Setpoint      1    0.7970 36.071 -8268.5
## - Usage.cont         1    0.8280 36.102 -8266.7
## - Balling.Lvl        1    0.8427 36.117 -8265.8
## - Hyd.Pressure3      1    1.0608 36.335 -8253.5
## - Carb.Pressure1     1    1.1732 36.447 -8247.1
## - Mnf.Flow           1    4.1455 39.420 -8085.9
## - Brand.Code         3    4.5148 39.789 -8070.8
## 
## Step:  AIC=-8313.65
## PH_train ~ Brand.Code + Fill.Ounces + Carb.Temp + PSC + PSC.Fill + 
##     Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + 
##     Filler.Speed + Temperature + Usage.cont + Density + Balling + 
##     Pressure.Vacuum + Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + 
##     Balling.Lvl
## 
##                     Df Sum of Sq    RSS     AIC
## - Filler.Speed       1    0.0202 35.307 -8314.5
## - Carb.Temp          1    0.0270 35.314 -8314.1
## <none>                           35.287 -8313.7
## - PSC.Fill           1    0.0390 35.326 -8313.4
## - Fill.Pressure      1    0.0627 35.350 -8312.0
## - Pressure.Vacuum    1    0.0691 35.356 -8311.6
## - PSC                1    0.0884 35.375 -8310.5
## - Fill.Ounces        1    0.1899 35.477 -8304.6
## - Pressure.Setpoint  1    0.2282 35.515 -8302.4
## - Density            1    0.2478 35.535 -8301.3
## - Balling            1    0.3264 35.613 -8296.7
## - Oxygen.Filler      1    0.4568 35.744 -8289.2
## - Temperature        1    0.7829 36.070 -8270.5
## - Bowl.Setpoint      1    0.7978 36.085 -8269.7
## - Balling.Lvl        1    0.8366 36.123 -8267.5
## - Usage.cont         1    0.8403 36.127 -8267.3
## - Hyd.Pressure3      1    1.0877 36.375 -8253.2
## - Carb.Pressure1     1    1.1673 36.454 -8248.7
## - Mnf.Flow           1    4.1793 39.466 -8085.5
## - Brand.Code         3    4.6334 39.920 -8066.0
## 
## Step:  AIC=-8314.48
## PH_train ~ Brand.Code + Fill.Ounces + Carb.Temp + PSC + PSC.Fill + 
##     Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + 
##     Temperature + Usage.cont + Density + Balling + Pressure.Vacuum + 
##     Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl
## 
##                     Df Sum of Sq    RSS     AIC
## - Carb.Temp          1    0.0301 35.337 -8314.7
## <none>                           35.307 -8314.5
## - PSC.Fill           1    0.0391 35.346 -8314.2
## - Pressure.Vacuum    1    0.0535 35.361 -8313.4
## - Fill.Pressure      1    0.0583 35.365 -8313.1
## - PSC                1    0.0875 35.395 -8311.4
## - Fill.Ounces        1    0.1872 35.494 -8305.6
## - Density            1    0.2370 35.544 -8302.7
## - Pressure.Setpoint  1    0.2398 35.547 -8302.6
## - Balling            1    0.3063 35.613 -8298.7
## - Oxygen.Filler      1    0.4601 35.767 -8289.9
## - Bowl.Setpoint      1    0.8028 36.110 -8270.3
## - Balling.Lvl        1    0.8246 36.132 -8269.0
## - Usage.cont         1    0.8365 36.144 -8268.3
## - Temperature        1    0.8835 36.191 -8265.7
## - Carb.Pressure1     1    1.1626 36.470 -8249.9
## - Hyd.Pressure3      1    1.2558 36.563 -8244.6
## - Mnf.Flow           1    4.1714 39.478 -8086.9
## - Brand.Code         3    4.6178 39.925 -8067.8
## 
## Step:  AIC=-8314.73
## PH_train ~ Brand.Code + Fill.Ounces + PSC + PSC.Fill + Mnf.Flow + 
##     Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + Temperature + 
##     Usage.cont + Density + Balling + Pressure.Vacuum + Oxygen.Filler + 
##     Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           35.337 -8314.7
## - PSC.Fill           1    0.0400 35.377 -8314.4
## - Pressure.Vacuum    1    0.0520 35.389 -8313.7
## - Fill.Pressure      1    0.0566 35.394 -8313.4
## - PSC                1    0.0911 35.428 -8311.4
## - Fill.Ounces        1    0.1869 35.524 -8305.9
## - Density            1    0.2309 35.568 -8303.3
## - Pressure.Setpoint  1    0.2441 35.581 -8302.6
## - Balling            1    0.3084 35.646 -8298.9
## - Oxygen.Filler      1    0.4644 35.802 -8289.9
## - Bowl.Setpoint      1    0.7958 36.133 -8270.9
## - Balling.Lvl        1    0.8243 36.161 -8269.3
## - Usage.cont         1    0.8357 36.173 -8268.7
## - Temperature        1    0.8720 36.209 -8266.6
## - Carb.Pressure1     1    1.1577 36.495 -8250.4
## - Hyd.Pressure3      1    1.2744 36.612 -8243.9
## - Mnf.Flow           1    4.1877 39.525 -8086.5
## - Brand.Code         3    4.6267 39.964 -8067.8

## 
## Call:
## lm(formula = PH_train ~ Brand.Code + Fill.Ounces + PSC + PSC.Fill + 
##     Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + 
##     Temperature + Usage.cont + Density + Balling + Pressure.Vacuum + 
##     Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl, 
##     data = yb)
## 
## Coefficients:
##       (Intercept)        Brand.CodeB        Brand.CodeC        Brand.CodeD  
##        11.8027377          0.0947439         -0.0465888          0.0840951  
##       Fill.Ounces                PSC           PSC.Fill           Mnf.Flow  
##        -0.1130416         -0.1407206         -0.0389223         -0.0007735  
##    Carb.Pressure1      Fill.Pressure      Hyd.Pressure3        Temperature  
##         0.0057393          0.0023010          0.0028126         -0.0172319  
##        Usage.cont            Density            Balling    Pressure.Vacuum  
##        -0.0081892         -0.1052398         -0.1005178         -0.0130724  
##     Oxygen.Filler      Bowl.Setpoint  Pressure.Setpoint        Balling.Lvl  
##        -0.4150076          0.0017462         -0.0077552          0.1654405

lmod_step<-lm(formula = PH_train ~ Brand.Code + Fill.Ounces + PSC + Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + Temperature + Usage.cont + Density + Balling + Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl, data = yb)

summary(lmod_step)

## 
## Call:
## lm(formula = PH_train ~ Brand.Code + Fill.Ounces + PSC + Mnf.Flow + 
##     Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + Temperature + 
##     Usage.cont + Density + Balling + Oxygen.Filler + Bowl.Setpoint + 
##     Pressure.Setpoint + Balling.Lvl, data = yb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52140 -0.07912  0.01137  0.08991  0.39263 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.189e+01  8.516e-01  13.963  < 2e-16 ***
## Brand.CodeB        9.529e-02  2.324e-02   4.101 4.28e-05 ***
## Brand.CodeC       -4.542e-02  2.319e-02  -1.959 0.050284 .  
## Brand.CodeD        8.306e-02  1.124e-02   7.389 2.15e-13 ***
## Fill.Ounces       -1.173e-01  3.442e-02  -3.409 0.000665 ***
## PSC               -1.622e-01  6.055e-02  -2.678 0.007459 ** 
## Mnf.Flow          -7.685e-04  4.975e-05 -15.446  < 2e-16 ***
## Carb.Pressure1     5.890e-03  6.974e-04   8.446  < 2e-16 ***
## Fill.Pressure      2.272e-03  1.275e-03   1.782 0.074957 .  
## Hyd.Pressure3      2.971e-03  3.186e-04   9.325  < 2e-16 ***
## Temperature       -1.647e-02  2.390e-03  -6.891 7.36e-12 ***
## Usage.cont        -8.143e-03  1.181e-03  -6.895 7.15e-12 ***
## Density           -1.092e-01  2.879e-02  -3.794 0.000152 ***
## Balling           -8.123e-02  2.104e-02  -3.860 0.000117 ***
## Oxygen.Filler     -3.952e-01  7.981e-02  -4.952 7.95e-07 ***
## Bowl.Setpoint      1.718e-03  2.578e-04   6.665 3.40e-11 ***
## Pressure.Setpoint -7.656e-03  2.068e-03  -3.702 0.000220 ***
## Balling.Lvl        1.479e-01  2.180e-02   6.786 1.51e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1318 on 2038 degrees of freedom
## Multiple R-squared:  0.4073, Adjusted R-squared:  0.4024 
## F-statistic: 82.38 on 17 and 2038 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lmod_step)

yb_step<-yb|>
  select(Brand.Code, Fill.Ounces, PSC, Mnf.Flow, Carb.Pressure1, Fill.Pressure, Hyd.Pressure3, Temperature, Usage.cont, Density, Balling, Oxygen.Filler, Bowl.Setpoint, Pressure.Setpoint, Balling.Lvl)

OLS_predict<-lmod_step|>predict(beverage_test)

OLS_per<-postResample(pred = OLS_predict, obs = PH_test)

The R-squared is too low for a good model.

Lasso__alpha = 1, ridge alpha =0, others are Elasticnet

#Try Lasso, Ridge and PLS
library(glmnet)

## 载入需要的程辑包：Matrix

## 
## 载入程辑包：'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-7

## 
## 载入程辑包：'glmnet'

## The following object is masked from 'package:imputeTS':
## 
##     na.replace

#Run cv.glment function with a wide range of alphas with an increments of 0.05
models <- list()
for (i in 0:20) {
  name <- paste0("alpha", i/20)
  models[[name]] <-
    cv.glmnet(data.matrix(beverage_train), data.matrix(PH_train), type.measure="mse", alpha=i/20, 
              family="gaussian")
}

#Compute the mean square error for each model where lambda equals to lamd.1se.
results <- data.frame()
for (i in 0:20) {
  name <- paste0("alpha", i/20)
  
  ## Use each model to predict 'y' given the Testing dataset
  predicted <- predict(models[[name]], 
                       s=models[[name]]$lambda.1se, newx=data.matrix(beverage_test))
  
  ## Calculate the Root Mean Squared Error...
  print(paste0(name, "   ", postResample(predicted, PH_test)))
}

## [1] "alpha0   0.144628947027996" "alpha0   0.352745875148305"
## [3] "alpha0   0.112719069471538"
## [1] "alpha0.05   0.143345344620646" "alpha0.05   0.356693089451303"
## [3] "alpha0.05   0.111128472260664"
## [1] "alpha0.1   0.142588323797725" "alpha0.1   0.35866926620948" 
## [3] "alpha0.1   0.109987071714831"
## [1] "alpha0.15   0.143584901960664" "alpha0.15   0.355668754155426"
## [3] "alpha0.15   0.111375037346323"
## [1] "alpha0.2   0.143675841397845" "alpha0.2   0.355372635620023"
## [3] "alpha0.2   0.111464476670412"
## [1] "alpha0.25   0.144273521705058" "alpha0.25   0.353005712525072"
## [3] "alpha0.25   0.112112924160866"
## [1] "alpha0.3   0.144022527687625" "alpha0.3   0.353810738291371"
## [3] "alpha0.3   0.11181355235111" 
## [1] "alpha0.35   0.143298589656592" "alpha0.35   0.356216457255417"
## [3] "alpha0.35   0.1109547240817"  
## [1] "alpha0.4   0.14282353052989"  "alpha0.4   0.357280198832823"
## [3] "alpha0.4   0.110330382589173"
## [1] "alpha0.45   0.143324579527607" "alpha0.45   0.355944406422526"
## [3] "alpha0.45   0.110965230010694"
## [1] "alpha0.5   0.144439292488163" "alpha0.5   0.351969046802832"
## [3] "alpha0.5   0.112301084493151"
## [1] "alpha0.55   0.143681459973547" "alpha0.55   0.354707613293933"
## [3] "alpha0.55   0.111382970051521"
## [1] "alpha0.6   0.143618101630229" "alpha0.6   0.354823227049457"
## [3] "alpha0.6   0.111298564450569"
## [1] "alpha0.65   0.143320949339869" "alpha0.65   0.355568715300937"
## [3] "alpha0.65   0.110924744784224"
## [1] "alpha0.7   0.143070585294359" "alpha0.7   0.356160990541308"
## [3] "alpha0.7   0.11061527559736" 
## [1] "alpha0.75   0.142505816304062" "alpha0.75   0.356950999300253"
## [3] "alpha0.75   0.109691525908328"
## [1] "alpha0.8   0.143017258369607" "alpha0.8   0.356143647537724"
## [3] "alpha0.8   0.110531146630373"
## [1] "alpha0.85   0.143670263157645" "alpha0.85   0.354393283774306"
## [3] "alpha0.85   0.111353933283967"
## [1] "alpha0.9   0.143389776220651" "alpha0.9   0.355092640060184"
## [3] "alpha0.9   0.110984760537508"
## [1] "alpha0.95   0.143611010137807" "alpha0.95   0.354482270527309"
## [3] "alpha0.95   0.111270787168686"
## [1] "alpha1   0.142657148799033" "alpha1   0.356601574987147"
## [3] "alpha1   0.109950530264747"

Elast <- cv.glmnet(data.matrix(beverage_train), data.matrix(PH_train), type.measure="mse", alpha=0.85, family="gaussian")

Elast_predict<-Elast|>predict(data.matrix(beverage_test))

elastic_per<-postResample(pred =Elast_predict, obs = PH_test)

**The best is at RMSE is at alpha 0.85

PLS

# PLS Model
pls <- train(x = data.matrix(beverage_train), y = PH_train, method = "pls", metric='Rsquare', tuneLength = 20, trControl = trainControl(method = "cv"), preProcess = c("center", "scale"))

pls

## Partial Least Squares 
## 
## 2056 samples
##   22 predictor
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1850, 1850, 1850, 1851, 1850, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1525111  0.2062955  0.1214919
##    2     0.1451964  0.2795336  0.1145425
##    3     0.1437283  0.2937023  0.1125902
##    4     0.1427832  0.3028968  0.1121557
##    5     0.1415995  0.3140656  0.1109464
##    6     0.1409835  0.3196866  0.1100036
##    7     0.1407483  0.3221680  0.1096288
##    8     0.1405707  0.3242687  0.1094245
##    9     0.1405015  0.3249821  0.1093270
##   10     0.1404906  0.3252364  0.1092157
##   11     0.1404007  0.3260778  0.1091109
##   12     0.1402747  0.3274026  0.1088707
##   13     0.1402807  0.3273721  0.1088528
##   14     0.1402702  0.3273344  0.1088713
##   15     0.1402150  0.3279297  0.1088570
##   16     0.1401679  0.3286068  0.1089151
##   17     0.1402004  0.3282795  0.1089962
##   18     0.1401968  0.3283270  0.1089893
##   19     0.1401964  0.3283311  0.1089894
##   20     0.1401988  0.3283072  0.1089934
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 16.

plot(pls)

plspredict<-pls|>predict(data.matrix(beverage_test))

pls_per<-postResample(pred =plspredict, obs = PH_test)

KNN

set.seed(12345)
knn <- train(x = data.matrix(beverage_train), y = PH_train, method = "knn", preProc = c("center", "scale"), tuneLength = 10)

knn

## k-Nearest Neighbors 
## 
## 2056 samples
##   22 predictor
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2056, 2056, 2056, 2056, 2056, 2056, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.1427398  0.3353862  0.1061367
##    7  0.1389803  0.3527375  0.1044466
##    9  0.1375999  0.3588496  0.1041113
##   11  0.1367679  0.3630672  0.1038095
##   13  0.1363377  0.3649570  0.1037478
##   15  0.1362424  0.3649974  0.1038454
##   17  0.1361829  0.3649795  0.1041369
##   19  0.1363455  0.3632924  0.1044109
##   21  0.1366181  0.3607244  0.1047539
##   23  0.1367730  0.3590180  0.1050115
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

knnPred <- knn|>predict(data.matrix(beverage_test))

knn_per<-postResample(pred = knnPred, obs = PH_test)

Neural Net

Decay = .04, size = 3.

set.seed(12345)
library(nnet)
nn <- nnet(data.matrix(beverage_train),  PH_train,
                size = 3,
                decay = 0.04,
                linout = TRUE,
                trace = FALSE,
                maxit = 5000,
                MaxNWts = 5 * (ncol(beverage_train) + 1) +5 +1)
nn

## a 22-3-1 network with 73 weights
## options were - linear output units  decay=0.04

nnPred <- nn|>predict(data.matrix(beverage_test))

nn_per<-postResample(pred = nnPred, obs = PH_test)

MARS

set.seed(12345)
library(earth)

## 载入需要的程辑包：Formula

## 载入需要的程辑包：plotmo

## 载入需要的程辑包：plotrix

## 载入需要的程辑包：TeachingDemos

mars <- earth(data.matrix(beverage_train), PH_train)

mars

## Selected 18 of 22 terms, and 10 of 22 predictors
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: Mnf.Flow, Brand.Code, Usage.cont, Pressure.Vacuum, ...
## Number of terms at each degree of interaction: 1 17 (additive model)
## GCV 0.01663529    RSS 33.04759    GRSq 0.4283813    RSq 0.4471397

marsPred <- mars|>predict(data.matrix(beverage_test))

mars_per<-postResample(pred = marsPred, obs = PH_test)

SVM (linear and radial)

set.seed(12345)
library(caret)

svm_Radial <- train(data.matrix(beverage_train), PH_train, method = "svmRadial", preProc = c("center","scale"), trControl = trainControl(method = "cv"))

svm_Radial

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2056 samples
##   22 predictor
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1850, 1851, 1850, 1851, 1850, 1850, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE       Rsquared   MAE       
##   0.25  0.1297488  0.4312326  0.09728400
##   0.50  0.1269181  0.4533325  0.09430035
##   1.00  0.1254221  0.4643878  0.09267249
## 
## Tuning parameter 'sigma' was held constant at a value of 0.03242085
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.03242085 and C = 1.

svm_RadialPred <- svm_Radial|>predict(data.matrix(beverage_test))

svm_Radial_per<-postResample(pred = svm_RadialPred, obs = PH_test) 

#SVM Linear
svm_Linear <- train(data.matrix(beverage_train), PH_train, method = "svmLinear", preProc = c("center","scale"), trControl = trainControl(method = "cv"))

svm_Linear

## Support Vector Machines with Linear Kernel 
## 
## 2056 samples
##   22 predictor
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1850, 1851, 1851, 1851, 1851, 1851, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1418761  0.3166113  0.1079261
## 
## Tuning parameter 'C' was held constant at a value of 1

svm_LinearPred <- svm_Linear|>predict(data.matrix(beverage_test))

svm_Linear_per<-postResample(pred = svm_LinearPred, obs = PH_test)

Random Forest

set.seed(12345)
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## 载入程辑包：'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

PH_train2<-cbind(beverage_train, PH_train)

rf <- randomForest(PH_train ~., data = PH_train2, importance = TRUE, ntree = 1000)

rf

## 
## Call:
##  randomForest(formula = PH_train ~ ., data = PH_train2, importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 0.0109958
##                     % Var explained: 62.18

rfPred <- rf|>predict(beverage_test)

rf_per<-postResample(pred = rfPred, obs = PH_test)

GBM

Depth = 4, ntrees = 10000, shrinkage = .007

set.seed(12345)
library(gbm)

## Loaded gbm 2.1.8.1

boosted <- gbm(
  formula = PH_train ~., data = PH_train2[,-1],
  distribution = "gaussian",
  n.trees = 10000,
  interaction.depth = 4,
  shrinkage = 0.007,
  cv.folds = 10,
  n.cores = NULL, 
  verbose = FALSE
  ) 

boosted

## gbm(formula = PH_train ~ ., distribution = "gaussian", data = PH_train2[, 
##     -1], n.trees = 10000, interaction.depth = 4, shrinkage = 0.007, 
##     cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 10000 iterations were performed.
## The best cross-validation iteration was 10000.
## There were 21 predictors of which 21 had non-zero influence.

print(paste("Number of trees:", boosted$n.trees))

## [1] "Number of trees: 10000"

boostPred <- predict(boosted, n.trees = boosted$n.trees, newdata = as.data.frame(beverage_test))

boost_per<-postResample(pred = boostPred, obs = PH_test)

Cubist

set.seed(12345)
library(Cubist)

cubist <- cubist(beverage_train, PH_train)

cubist

## 
## Call:
## cubist.default(x = beverage_train, y = PH_train)
## 
## Number of samples: 2056 
## Number of predictors: 22 
## 
## Number of committees: 1 
## Number of rules: 24

cubistPred <- predict(cubist,  newdata = beverage_test)

cubist_per<-postResample(pred = cubistPred, obs = PH_test)

Choosing the best model

#Create a summary for all models:
rbind(OLS_per, elastic_per, pls_per, knn_per, nn_per, mars_per, svm_Radial_per, svm_Linear_per, rf_per, boost_per, cubist_per)

##                      RMSE  Rsquared        MAE
## OLS_per        0.13484747 0.4184747 0.10496797
## elastic_per    0.14367026 0.3543933 0.11135393
## pls_per        0.14209470 0.3544839 0.10793878
## knn_per        0.12831982 0.4897171 0.09804966
## nn_per         0.14031098 0.3694024 0.10466801
## mars_per       0.13036106 0.4563444 0.09816436
## svm_Radial_per 0.12579316 0.5008554 0.09117058
## svm_Linear_per 0.14395534 0.3461752 0.10781250
## rf_per         0.09994053 0.7113584 0.07322895
## boost_per      0.10775088 0.6304643 0.08003457
## cubist_per     0.12128567 0.5402787 0.08648989

**It looks like Random Forest and Boosted are the best models from both RMSE and Rsquared.

Assessing variable importance

lmod_Imp<-varImp(lmod_step)
rf_Imp<-varImp(rf)

summary.gbm(boosted)

Conclusion

We tested 11 machine learning on beverage data in order to predict PH. OUr final model is tree-based (Random Forest). The Model R-Squared is 69%, which is not a good model. More work is required to be done for predictors to have a better model.

DATA624 Project 2

Chun Yip

2023/05/10

Project Requirements

Exploratory Data Analysis

Machine Learning Models

Lasso__alpha = 1, ridge alpha =0, others are Elasticnet

PLS

KNN

Neural Net

MARS

SVM (linear and radial)

Random Forest

GBM

Cubist

Choosing the best model

Assessing variable importance

Conclusion