This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH. Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach. Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
A summary of the data set-student data can be found below:
#Load library
library(fpp3)
library(ggplot2)
library(feasts)
library(readxl)
library(openxlsx)
library(imputeTS)
library(caret)
library(corrplot)
library(e1071)
#Read data
beverage<-read_xlsx('StudentData.xlsx')#|>
#Find out all the NAs
beverage|>
skimr::skim()|>
filter(n_missing>0)|>
arrange(desc(n_missing))#|>| Name | beverage |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 30 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
#select(-PH)There are 32 predictors and 1 response. All predictors are numeric variables except the variable Brand Code which is character
Additional findings on data: 1. There were many NAs in each column. 2. Only a small number of predictors are normally distributed.
Using the preProcess with KNN for imputation
preProcValues<-preProcess(as.data.frame(beverage), method="knnImpute", k=5, knnSummary=mean)
impute_beverage<-predict(preProcValues, beverage, na.action=na.pass)
procNames <- data.frame(col = names(preProcValues$mean), mean = preProcValues$mean, sd = preProcValues$std)
for(i in procNames$col){
impute_beverage[i] <- impute_beverage[i]*preProcValues$std[i]+preProcValues$mean[i]
}
library(imputeMissings)
# save the result as another object
impute_beverage_c <- impute(impute_beverage, method = "median/mode")
# check if there is any NAs
skimr::skim(impute_beverage_c)| Name | impute_beverage_c |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand.Code | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb.Volume | 0 | 1 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill.Ounces | 0 | 1 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC.Volume | 0 | 1 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb.Pressure | 0 | 1 | 68.21 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb.Temp | 0 | 1 | 141.12 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 0 | 1 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC.Fill | 0 | 1 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC.CO2 | 0 | 1 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf.Flow | 0 | 1 | 24.47 | 119.49 | -100.20 | -100.00 | 64.80 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb.Pressure1 | 0 | 1 | 122.54 | 4.74 | 105.60 | 118.80 | 123.20 | 125.40 | 140.20 | ▁▅▇▂▁ |
| Fill.Pressure | 0 | 1 | 47.93 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd.Pressure1 | 0 | 1 | 12.46 | 12.42 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd.Pressure2 | 0 | 1 | 20.96 | 16.37 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd.Pressure3 | 0 | 1 | 20.43 | 15.95 | -1.20 | 0.00 | 27.40 | 33.20 | 50.00 | ▇▁▃▇▁ |
| Hyd.Pressure4 | 0 | 1 | 96.37 | 13.09 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler.Level | 0 | 1 | 109.24 | 15.68 | 55.80 | 98.40 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler.Speed | 0 | 1 | 3681.24 | 767.26 | 998.00 | 3866.80 | 3980.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage.cont | 0 | 1 | 20.99 | 2.98 | 12.08 | 18.36 | 21.80 | 23.76 | 25.90 | ▁▃▅▃▇ |
| Carb.Flow | 0 | 1 | 2467.97 | 1073.38 | 26.00 | 1151.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 0 | 1 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 0 | 1 | 690.08 | 92.36 | 31.40 | 697.60 | 722.20 | 730.40 | 868.60 | ▁▁▁▂▇ |
| Balling | 0 | 1 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure.Vacuum | 0 | 1 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen.Filler | 0 | 1 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl.Setpoint | 0 | 1 | 109.30 | 15.32 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure.Setpoint | 0 | 1 | 47.61 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air.Pressurer | 0 | 1 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch.Rel | 0 | 1 | 6.90 | 0.50 | 5.28 | 6.54 | 6.56 | 7.23 | 8.62 | ▁▇▂▃▁ |
| Carb.Rel | 0 | 1 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▆▂▁ |
| Balling.Lvl | 0 | 1 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
The impute_beverage is normalized and is then de-normalized to get the original data back. However, the preProcess function of Caret packages only works for numerical values. The function impute from “imputeMissings package” is used to fill up the remaining missing values. There were 120 missing values in Brand Code, which is less than 5% of the data, it should be fine to impute in this way.
The data is further reviewed to look for correlation among variables.
td<-impute_beverage_c[,-1]
cor_res <- cor(td, use = "na.or.complete")
corrplot(cor_res,
type = "lower",
order = "original",
tl.col = "Blue",
tl.srt = 45,
tl.cex = 0.5
)#Filter out high correlation variables
highCorr <- findCorrelation(cor_res, cutoff = .75)
length(highCorr)## [1] 10
filtered_impute_beverage_c <- impute_beverage_c[, -highCorr]
skimr::skim(filtered_impute_beverage_c)| Name | filtered_impute_beverage_… |
| Number of rows | 2571 |
| Number of columns | 23 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 22 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand.Code | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb.Volume | 0 | 1 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill.Ounces | 0 | 1 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC.Volume | 0 | 1 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb.Temp | 0 | 1 | 141.12 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 0 | 1 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC.Fill | 0 | 1 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| Mnf.Flow | 0 | 1 | 24.47 | 119.49 | -100.20 | -100.00 | 64.80 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb.Pressure1 | 0 | 1 | 122.54 | 4.74 | 105.60 | 118.80 | 123.20 | 125.40 | 140.20 | ▁▅▇▂▁ |
| Fill.Pressure | 0 | 1 | 47.93 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd.Pressure1 | 0 | 1 | 12.46 | 12.42 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd.Pressure3 | 0 | 1 | 20.43 | 15.95 | -1.20 | 0.00 | 27.40 | 33.20 | 50.00 | ▇▁▃▇▁ |
| Filler.Speed | 0 | 1 | 3681.24 | 767.26 | 998.00 | 3866.80 | 3980.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage.cont | 0 | 1 | 20.99 | 2.98 | 12.08 | 18.36 | 21.80 | 23.76 | 25.90 | ▁▃▅▃▇ |
| Density | 0 | 1 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| Balling | 0 | 1 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure.Vacuum | 0 | 1 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen.Filler | 0 | 1 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl.Setpoint | 0 | 1 | 109.30 | 15.32 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure.Setpoint | 0 | 1 | 47.61 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Balling.Lvl | 0 | 1 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
10 high correlated variables are removed.
Check normality of all data
filtered_impute_beverage_c|>
gather(key='predictor',value = 'value')|>
ggplot(aes(x=value))+
geom_bar()+
facet_wrap(~predictor, scales='free')
The normality of PH is ok.
Take a first look on the distribution of response PH before splitting data. Then splitting the data into a training and a test set by a ratio of 80-20
set.seed(2744)
#Remove a outliner from the result in OLS
rout<-c(1094,1719,2359)
filtered_impute_beverage_c<-filtered_impute_beverage_c[-rout,]
fold <- filtered_impute_beverage_c$PH %>%
createDataPartition(p = 0.8, list = FALSE, times = 1)
#Create training and testing set
beverage_train<-filtered_impute_beverage_c[fold,-19]
beverage_test<-filtered_impute_beverage_c[-fold,-19]
PH_train<-filtered_impute_beverage_c[fold,19]
PH_test<-filtered_impute_beverage_c[-fold,19]Start with OLS
#Try the traditional linear regression
yb<-as.data.frame(cbind(PH_train, beverage_train))
lmod<-lm(PH_train~., yb)
summary(lmod)##
## Call:
## lm(formula = PH_train ~ ., data = yb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52590 -0.07900 0.01000 0.08871 0.39979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.183e+01 8.821e-01 13.413 < 2e-16 ***
## Brand.CodeB 8.741e-02 2.372e-02 3.685 0.000234 ***
## Brand.CodeC -5.461e-02 2.386e-02 -2.289 0.022210 *
## Brand.CodeD 8.914e-02 1.214e-02 7.344 2.98e-13 ***
## Carb.Volume -4.560e-02 5.024e-02 -0.908 0.364177
## Fill.Ounces -1.122e-01 3.517e-02 -3.189 0.001450 **
## PC.Volume -3.786e-02 5.793e-02 -0.654 0.513490
## Carb.Temp 7.290e-04 7.390e-04 0.986 0.324028
## PSC -1.272e-01 6.369e-02 -1.996 0.046024 *
## PSC.Fill -4.044e-02 2.574e-02 -1.571 0.116359
## Mnf.Flow -7.718e-04 4.990e-05 -15.467 < 2e-16 ***
## Carb.Pressure1 5.836e-03 7.423e-04 7.862 6.09e-15 ***
## Fill.Pressure 2.309e-03 1.292e-03 1.787 0.074070 .
## Hyd.Pressure1 -1.265e-04 3.592e-04 -0.352 0.724824
## Hyd.Pressure3 2.796e-03 4.352e-04 6.425 1.64e-10 ***
## Filler.Speed 5.210e-06 4.939e-06 1.055 0.291678
## Temperature -1.675e-02 2.500e-03 -6.699 2.70e-11 ***
## Usage.cont -8.332e-03 1.206e-03 -6.907 6.60e-12 ***
## Density -1.042e-01 2.946e-02 -3.535 0.000416 ***
## Balling -1.093e-01 2.487e-02 -4.393 1.17e-05 ***
## Pressure.Vacuum -1.659e-02 8.026e-03 -2.067 0.038845 *
## Oxygen.Filler -4.096e-01 8.052e-02 -5.087 3.97e-07 ***
## Bowl.Setpoint 1.817e-03 2.759e-04 6.586 5.75e-11 ***
## Pressure.Setpoint -7.249e-03 2.090e-03 -3.469 0.000533 ***
## Balling.Lvl 1.725e-01 2.487e-02 6.936 5.42e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1318 on 2031 degrees of freedom
## Multiple R-squared: 0.4101, Adjusted R-squared: 0.4031
## F-statistic: 58.83 on 24 and 2031 DF, p-value: < 2.2e-16
#As there are way too many predictors, run the step function to minimize the predictor
step(lmod)## Start: AIC=-8309.07
## PH_train ~ Brand.Code + Carb.Volume + Fill.Ounces + PC.Volume +
## Carb.Temp + PSC + PSC.Fill + Mnf.Flow + Carb.Pressure1 +
## Fill.Pressure + Hyd.Pressure1 + Hyd.Pressure3 + Filler.Speed +
## Temperature + Usage.cont + Density + Balling + Pressure.Vacuum +
## Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl
##
## Df Sum of Sq RSS AIC
## - Hyd.Pressure1 1 0.0022 35.265 -8310.9
## - PC.Volume 1 0.0074 35.270 -8310.6
## - Carb.Volume 1 0.0143 35.277 -8310.2
## - Carb.Temp 1 0.0169 35.279 -8310.1
## - Filler.Speed 1 0.0193 35.282 -8309.9
## <none> 35.263 -8309.1
## - PSC.Fill 1 0.0428 35.305 -8308.6
## - Fill.Pressure 1 0.0555 35.318 -8307.8
## - PSC 1 0.0692 35.332 -8307.0
## - Pressure.Vacuum 1 0.0742 35.337 -8306.7
## - Fill.Ounces 1 0.1766 35.439 -8300.8
## - Pressure.Setpoint 1 0.2089 35.472 -8298.9
## - Density 1 0.2170 35.480 -8298.5
## - Balling 1 0.3351 35.598 -8291.6
## - Oxygen.Filler 1 0.4493 35.712 -8285.0
## - Hyd.Pressure3 1 0.7168 35.979 -8269.7
## - Bowl.Setpoint 1 0.7530 36.016 -8267.6
## - Temperature 1 0.7792 36.042 -8266.1
## - Usage.cont 1 0.8283 36.091 -8263.3
## - Balling.Lvl 1 0.8352 36.098 -8262.9
## - Carb.Pressure1 1 1.0731 36.336 -8249.4
## - Mnf.Flow 1 4.1533 39.416 -8082.1
## - Brand.Code 3 4.4929 39.755 -8068.5
##
## Step: AIC=-8310.94
## PH_train ~ Brand.Code + Carb.Volume + Fill.Ounces + PC.Volume +
## Carb.Temp + PSC + PSC.Fill + Mnf.Flow + Carb.Pressure1 +
## Fill.Pressure + Hyd.Pressure3 + Filler.Speed + Temperature +
## Usage.cont + Density + Balling + Pressure.Vacuum + Oxygen.Filler +
## Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl
##
## Df Sum of Sq RSS AIC
## - PC.Volume 1 0.0095 35.274 -8312.4
## - Carb.Volume 1 0.0138 35.279 -8312.1
## - Carb.Temp 1 0.0177 35.282 -8311.9
## - Filler.Speed 1 0.0214 35.286 -8311.7
## <none> 35.265 -8310.9
## - PSC.Fill 1 0.0429 35.308 -8310.4
## - Fill.Pressure 1 0.0591 35.324 -8309.5
## - PSC 1 0.0683 35.333 -8309.0
## - Pressure.Vacuum 1 0.0775 35.342 -8308.4
## - Fill.Ounces 1 0.1764 35.441 -8302.7
## - Pressure.Setpoint 1 0.2106 35.475 -8300.7
## - Density 1 0.2254 35.490 -8299.8
## - Balling 1 0.3332 35.598 -8293.6
## - Oxygen.Filler 1 0.4479 35.713 -8287.0
## - Temperature 1 0.7771 36.042 -8268.1
## - Bowl.Setpoint 1 0.7988 36.064 -8266.9
## - Usage.cont 1 0.8302 36.095 -8265.1
## - Balling.Lvl 1 0.8330 36.098 -8264.9
## - Hyd.Pressure3 1 1.0698 36.335 -8251.5
## - Carb.Pressure1 1 1.1133 36.378 -8249.0
## - Mnf.Flow 1 4.1519 39.417 -8084.1
## - Brand.Code 3 4.4947 39.759 -8070.3
##
## Step: AIC=-8312.39
## PH_train ~ Brand.Code + Carb.Volume + Fill.Ounces + Carb.Temp +
## PSC + PSC.Fill + Mnf.Flow + Carb.Pressure1 + Fill.Pressure +
## Hyd.Pressure3 + Filler.Speed + Temperature + Usage.cont +
## Density + Balling + Pressure.Vacuum + Oxygen.Filler + Bowl.Setpoint +
## Pressure.Setpoint + Balling.Lvl
##
## Df Sum of Sq RSS AIC
## - Carb.Volume 1 0.0126 35.287 -8313.7
## - Carb.Temp 1 0.0177 35.292 -8313.4
## - Filler.Speed 1 0.0216 35.296 -8313.1
## <none> 35.274 -8312.4
## - PSC.Fill 1 0.0397 35.314 -8312.1
## - Fill.Pressure 1 0.0635 35.338 -8310.7
## - Pressure.Vacuum 1 0.0716 35.346 -8310.2
## - PSC 1 0.0881 35.362 -8309.3
## - Fill.Ounces 1 0.1693 35.444 -8304.5
## - Pressure.Setpoint 1 0.2223 35.497 -8301.5
## - Density 1 0.2384 35.513 -8300.5
## - Balling 1 0.3290 35.603 -8295.3
## - Oxygen.Filler 1 0.4595 35.734 -8287.8
## - Temperature 1 0.7833 36.058 -8269.2
## - Bowl.Setpoint 1 0.7970 36.071 -8268.5
## - Usage.cont 1 0.8280 36.102 -8266.7
## - Balling.Lvl 1 0.8427 36.117 -8265.8
## - Hyd.Pressure3 1 1.0608 36.335 -8253.5
## - Carb.Pressure1 1 1.1732 36.447 -8247.1
## - Mnf.Flow 1 4.1455 39.420 -8085.9
## - Brand.Code 3 4.5148 39.789 -8070.8
##
## Step: AIC=-8313.65
## PH_train ~ Brand.Code + Fill.Ounces + Carb.Temp + PSC + PSC.Fill +
## Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 +
## Filler.Speed + Temperature + Usage.cont + Density + Balling +
## Pressure.Vacuum + Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint +
## Balling.Lvl
##
## Df Sum of Sq RSS AIC
## - Filler.Speed 1 0.0202 35.307 -8314.5
## - Carb.Temp 1 0.0270 35.314 -8314.1
## <none> 35.287 -8313.7
## - PSC.Fill 1 0.0390 35.326 -8313.4
## - Fill.Pressure 1 0.0627 35.350 -8312.0
## - Pressure.Vacuum 1 0.0691 35.356 -8311.6
## - PSC 1 0.0884 35.375 -8310.5
## - Fill.Ounces 1 0.1899 35.477 -8304.6
## - Pressure.Setpoint 1 0.2282 35.515 -8302.4
## - Density 1 0.2478 35.535 -8301.3
## - Balling 1 0.3264 35.613 -8296.7
## - Oxygen.Filler 1 0.4568 35.744 -8289.2
## - Temperature 1 0.7829 36.070 -8270.5
## - Bowl.Setpoint 1 0.7978 36.085 -8269.7
## - Balling.Lvl 1 0.8366 36.123 -8267.5
## - Usage.cont 1 0.8403 36.127 -8267.3
## - Hyd.Pressure3 1 1.0877 36.375 -8253.2
## - Carb.Pressure1 1 1.1673 36.454 -8248.7
## - Mnf.Flow 1 4.1793 39.466 -8085.5
## - Brand.Code 3 4.6334 39.920 -8066.0
##
## Step: AIC=-8314.48
## PH_train ~ Brand.Code + Fill.Ounces + Carb.Temp + PSC + PSC.Fill +
## Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 +
## Temperature + Usage.cont + Density + Balling + Pressure.Vacuum +
## Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl
##
## Df Sum of Sq RSS AIC
## - Carb.Temp 1 0.0301 35.337 -8314.7
## <none> 35.307 -8314.5
## - PSC.Fill 1 0.0391 35.346 -8314.2
## - Pressure.Vacuum 1 0.0535 35.361 -8313.4
## - Fill.Pressure 1 0.0583 35.365 -8313.1
## - PSC 1 0.0875 35.395 -8311.4
## - Fill.Ounces 1 0.1872 35.494 -8305.6
## - Density 1 0.2370 35.544 -8302.7
## - Pressure.Setpoint 1 0.2398 35.547 -8302.6
## - Balling 1 0.3063 35.613 -8298.7
## - Oxygen.Filler 1 0.4601 35.767 -8289.9
## - Bowl.Setpoint 1 0.8028 36.110 -8270.3
## - Balling.Lvl 1 0.8246 36.132 -8269.0
## - Usage.cont 1 0.8365 36.144 -8268.3
## - Temperature 1 0.8835 36.191 -8265.7
## - Carb.Pressure1 1 1.1626 36.470 -8249.9
## - Hyd.Pressure3 1 1.2558 36.563 -8244.6
## - Mnf.Flow 1 4.1714 39.478 -8086.9
## - Brand.Code 3 4.6178 39.925 -8067.8
##
## Step: AIC=-8314.73
## PH_train ~ Brand.Code + Fill.Ounces + PSC + PSC.Fill + Mnf.Flow +
## Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + Temperature +
## Usage.cont + Density + Balling + Pressure.Vacuum + Oxygen.Filler +
## Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl
##
## Df Sum of Sq RSS AIC
## <none> 35.337 -8314.7
## - PSC.Fill 1 0.0400 35.377 -8314.4
## - Pressure.Vacuum 1 0.0520 35.389 -8313.7
## - Fill.Pressure 1 0.0566 35.394 -8313.4
## - PSC 1 0.0911 35.428 -8311.4
## - Fill.Ounces 1 0.1869 35.524 -8305.9
## - Density 1 0.2309 35.568 -8303.3
## - Pressure.Setpoint 1 0.2441 35.581 -8302.6
## - Balling 1 0.3084 35.646 -8298.9
## - Oxygen.Filler 1 0.4644 35.802 -8289.9
## - Bowl.Setpoint 1 0.7958 36.133 -8270.9
## - Balling.Lvl 1 0.8243 36.161 -8269.3
## - Usage.cont 1 0.8357 36.173 -8268.7
## - Temperature 1 0.8720 36.209 -8266.6
## - Carb.Pressure1 1 1.1577 36.495 -8250.4
## - Hyd.Pressure3 1 1.2744 36.612 -8243.9
## - Mnf.Flow 1 4.1877 39.525 -8086.5
## - Brand.Code 3 4.6267 39.964 -8067.8
##
## Call:
## lm(formula = PH_train ~ Brand.Code + Fill.Ounces + PSC + PSC.Fill +
## Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 +
## Temperature + Usage.cont + Density + Balling + Pressure.Vacuum +
## Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl,
## data = yb)
##
## Coefficients:
## (Intercept) Brand.CodeB Brand.CodeC Brand.CodeD
## 11.8027377 0.0947439 -0.0465888 0.0840951
## Fill.Ounces PSC PSC.Fill Mnf.Flow
## -0.1130416 -0.1407206 -0.0389223 -0.0007735
## Carb.Pressure1 Fill.Pressure Hyd.Pressure3 Temperature
## 0.0057393 0.0023010 0.0028126 -0.0172319
## Usage.cont Density Balling Pressure.Vacuum
## -0.0081892 -0.1052398 -0.1005178 -0.0130724
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Balling.Lvl
## -0.4150076 0.0017462 -0.0077552 0.1654405
lmod_step<-lm(formula = PH_train ~ Brand.Code + Fill.Ounces + PSC + Mnf.Flow + Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + Temperature + Usage.cont + Density + Balling + Oxygen.Filler + Bowl.Setpoint + Pressure.Setpoint + Balling.Lvl, data = yb)
summary(lmod_step)##
## Call:
## lm(formula = PH_train ~ Brand.Code + Fill.Ounces + PSC + Mnf.Flow +
## Carb.Pressure1 + Fill.Pressure + Hyd.Pressure3 + Temperature +
## Usage.cont + Density + Balling + Oxygen.Filler + Bowl.Setpoint +
## Pressure.Setpoint + Balling.Lvl, data = yb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52140 -0.07912 0.01137 0.08991 0.39263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.189e+01 8.516e-01 13.963 < 2e-16 ***
## Brand.CodeB 9.529e-02 2.324e-02 4.101 4.28e-05 ***
## Brand.CodeC -4.542e-02 2.319e-02 -1.959 0.050284 .
## Brand.CodeD 8.306e-02 1.124e-02 7.389 2.15e-13 ***
## Fill.Ounces -1.173e-01 3.442e-02 -3.409 0.000665 ***
## PSC -1.622e-01 6.055e-02 -2.678 0.007459 **
## Mnf.Flow -7.685e-04 4.975e-05 -15.446 < 2e-16 ***
## Carb.Pressure1 5.890e-03 6.974e-04 8.446 < 2e-16 ***
## Fill.Pressure 2.272e-03 1.275e-03 1.782 0.074957 .
## Hyd.Pressure3 2.971e-03 3.186e-04 9.325 < 2e-16 ***
## Temperature -1.647e-02 2.390e-03 -6.891 7.36e-12 ***
## Usage.cont -8.143e-03 1.181e-03 -6.895 7.15e-12 ***
## Density -1.092e-01 2.879e-02 -3.794 0.000152 ***
## Balling -8.123e-02 2.104e-02 -3.860 0.000117 ***
## Oxygen.Filler -3.952e-01 7.981e-02 -4.952 7.95e-07 ***
## Bowl.Setpoint 1.718e-03 2.578e-04 6.665 3.40e-11 ***
## Pressure.Setpoint -7.656e-03 2.068e-03 -3.702 0.000220 ***
## Balling.Lvl 1.479e-01 2.180e-02 6.786 1.51e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1318 on 2038 degrees of freedom
## Multiple R-squared: 0.4073, Adjusted R-squared: 0.4024
## F-statistic: 82.38 on 17 and 2038 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod_step)yb_step<-yb|>
select(Brand.Code, Fill.Ounces, PSC, Mnf.Flow, Carb.Pressure1, Fill.Pressure, Hyd.Pressure3, Temperature, Usage.cont, Density, Balling, Oxygen.Filler, Bowl.Setpoint, Pressure.Setpoint, Balling.Lvl)
OLS_predict<-lmod_step|>predict(beverage_test)
OLS_per<-postResample(pred = OLS_predict, obs = PH_test) The R-squared is too low for a good model.
#Try Lasso, Ridge and PLS
library(glmnet)## 载入需要的程辑包:Matrix
##
## 载入程辑包:'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-7
##
## 载入程辑包:'glmnet'
## The following object is masked from 'package:imputeTS':
##
## na.replace
#Run cv.glment function with a wide range of alphas with an increments of 0.05
models <- list()
for (i in 0:20) {
name <- paste0("alpha", i/20)
models[[name]] <-
cv.glmnet(data.matrix(beverage_train), data.matrix(PH_train), type.measure="mse", alpha=i/20,
family="gaussian")
}
#Compute the mean square error for each model where lambda equals to lamd.1se.
results <- data.frame()
for (i in 0:20) {
name <- paste0("alpha", i/20)
## Use each model to predict 'y' given the Testing dataset
predicted <- predict(models[[name]],
s=models[[name]]$lambda.1se, newx=data.matrix(beverage_test))
## Calculate the Root Mean Squared Error...
print(paste0(name, " ", postResample(predicted, PH_test)))
}## [1] "alpha0 0.144628947027996" "alpha0 0.352745875148305"
## [3] "alpha0 0.112719069471538"
## [1] "alpha0.05 0.143345344620646" "alpha0.05 0.356693089451303"
## [3] "alpha0.05 0.111128472260664"
## [1] "alpha0.1 0.142588323797725" "alpha0.1 0.35866926620948"
## [3] "alpha0.1 0.109987071714831"
## [1] "alpha0.15 0.143584901960664" "alpha0.15 0.355668754155426"
## [3] "alpha0.15 0.111375037346323"
## [1] "alpha0.2 0.143675841397845" "alpha0.2 0.355372635620023"
## [3] "alpha0.2 0.111464476670412"
## [1] "alpha0.25 0.144273521705058" "alpha0.25 0.353005712525072"
## [3] "alpha0.25 0.112112924160866"
## [1] "alpha0.3 0.144022527687625" "alpha0.3 0.353810738291371"
## [3] "alpha0.3 0.11181355235111"
## [1] "alpha0.35 0.143298589656592" "alpha0.35 0.356216457255417"
## [3] "alpha0.35 0.1109547240817"
## [1] "alpha0.4 0.14282353052989" "alpha0.4 0.357280198832823"
## [3] "alpha0.4 0.110330382589173"
## [1] "alpha0.45 0.143324579527607" "alpha0.45 0.355944406422526"
## [3] "alpha0.45 0.110965230010694"
## [1] "alpha0.5 0.144439292488163" "alpha0.5 0.351969046802832"
## [3] "alpha0.5 0.112301084493151"
## [1] "alpha0.55 0.143681459973547" "alpha0.55 0.354707613293933"
## [3] "alpha0.55 0.111382970051521"
## [1] "alpha0.6 0.143618101630229" "alpha0.6 0.354823227049457"
## [3] "alpha0.6 0.111298564450569"
## [1] "alpha0.65 0.143320949339869" "alpha0.65 0.355568715300937"
## [3] "alpha0.65 0.110924744784224"
## [1] "alpha0.7 0.143070585294359" "alpha0.7 0.356160990541308"
## [3] "alpha0.7 0.11061527559736"
## [1] "alpha0.75 0.142505816304062" "alpha0.75 0.356950999300253"
## [3] "alpha0.75 0.109691525908328"
## [1] "alpha0.8 0.143017258369607" "alpha0.8 0.356143647537724"
## [3] "alpha0.8 0.110531146630373"
## [1] "alpha0.85 0.143670263157645" "alpha0.85 0.354393283774306"
## [3] "alpha0.85 0.111353933283967"
## [1] "alpha0.9 0.143389776220651" "alpha0.9 0.355092640060184"
## [3] "alpha0.9 0.110984760537508"
## [1] "alpha0.95 0.143611010137807" "alpha0.95 0.354482270527309"
## [3] "alpha0.95 0.111270787168686"
## [1] "alpha1 0.142657148799033" "alpha1 0.356601574987147"
## [3] "alpha1 0.109950530264747"
Elast <- cv.glmnet(data.matrix(beverage_train), data.matrix(PH_train), type.measure="mse", alpha=0.85, family="gaussian")
Elast_predict<-Elast|>predict(data.matrix(beverage_test))
elastic_per<-postResample(pred =Elast_predict, obs = PH_test)**The best is at RMSE is at alpha 0.85
# PLS Model
pls <- train(x = data.matrix(beverage_train), y = PH_train, method = "pls", metric='Rsquare', tuneLength = 20, trControl = trainControl(method = "cv"), preProcess = c("center", "scale"))
pls## Partial Least Squares
##
## 2056 samples
## 22 predictor
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1852, 1850, 1850, 1850, 1851, 1850, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1525111 0.2062955 0.1214919
## 2 0.1451964 0.2795336 0.1145425
## 3 0.1437283 0.2937023 0.1125902
## 4 0.1427832 0.3028968 0.1121557
## 5 0.1415995 0.3140656 0.1109464
## 6 0.1409835 0.3196866 0.1100036
## 7 0.1407483 0.3221680 0.1096288
## 8 0.1405707 0.3242687 0.1094245
## 9 0.1405015 0.3249821 0.1093270
## 10 0.1404906 0.3252364 0.1092157
## 11 0.1404007 0.3260778 0.1091109
## 12 0.1402747 0.3274026 0.1088707
## 13 0.1402807 0.3273721 0.1088528
## 14 0.1402702 0.3273344 0.1088713
## 15 0.1402150 0.3279297 0.1088570
## 16 0.1401679 0.3286068 0.1089151
## 17 0.1402004 0.3282795 0.1089962
## 18 0.1401968 0.3283270 0.1089893
## 19 0.1401964 0.3283311 0.1089894
## 20 0.1401988 0.3283072 0.1089934
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 16.
plot(pls)plspredict<-pls|>predict(data.matrix(beverage_test))
pls_per<-postResample(pred =plspredict, obs = PH_test) set.seed(12345)
knn <- train(x = data.matrix(beverage_train), y = PH_train, method = "knn", preProc = c("center", "scale"), tuneLength = 10)
knn## k-Nearest Neighbors
##
## 2056 samples
## 22 predictor
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2056, 2056, 2056, 2056, 2056, 2056, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1427398 0.3353862 0.1061367
## 7 0.1389803 0.3527375 0.1044466
## 9 0.1375999 0.3588496 0.1041113
## 11 0.1367679 0.3630672 0.1038095
## 13 0.1363377 0.3649570 0.1037478
## 15 0.1362424 0.3649974 0.1038454
## 17 0.1361829 0.3649795 0.1041369
## 19 0.1363455 0.3632924 0.1044109
## 21 0.1366181 0.3607244 0.1047539
## 23 0.1367730 0.3590180 0.1050115
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- knn|>predict(data.matrix(beverage_test))
knn_per<-postResample(pred = knnPred, obs = PH_test) Decay = .04, size = 3.
set.seed(12345)
library(nnet)
nn <- nnet(data.matrix(beverage_train), PH_train,
size = 3,
decay = 0.04,
linout = TRUE,
trace = FALSE,
maxit = 5000,
MaxNWts = 5 * (ncol(beverage_train) + 1) +5 +1)
nn## a 22-3-1 network with 73 weights
## options were - linear output units decay=0.04
nnPred <- nn|>predict(data.matrix(beverage_test))
nn_per<-postResample(pred = nnPred, obs = PH_test) set.seed(12345)
library(earth)## 载入需要的程辑包:Formula
## 载入需要的程辑包:plotmo
## 载入需要的程辑包:plotrix
## 载入需要的程辑包:TeachingDemos
mars <- earth(data.matrix(beverage_train), PH_train)
mars## Selected 18 of 22 terms, and 10 of 22 predictors
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: Mnf.Flow, Brand.Code, Usage.cont, Pressure.Vacuum, ...
## Number of terms at each degree of interaction: 1 17 (additive model)
## GCV 0.01663529 RSS 33.04759 GRSq 0.4283813 RSq 0.4471397
marsPred <- mars|>predict(data.matrix(beverage_test))
mars_per<-postResample(pred = marsPred, obs = PH_test) set.seed(12345)
library(caret)
svm_Radial <- train(data.matrix(beverage_train), PH_train, method = "svmRadial", preProc = c("center","scale"), trControl = trainControl(method = "cv"))
svm_Radial## Support Vector Machines with Radial Basis Function Kernel
##
## 2056 samples
## 22 predictor
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1850, 1851, 1850, 1851, 1850, 1850, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1297488 0.4312326 0.09728400
## 0.50 0.1269181 0.4533325 0.09430035
## 1.00 0.1254221 0.4643878 0.09267249
##
## Tuning parameter 'sigma' was held constant at a value of 0.03242085
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.03242085 and C = 1.
svm_RadialPred <- svm_Radial|>predict(data.matrix(beverage_test))
svm_Radial_per<-postResample(pred = svm_RadialPred, obs = PH_test)
#SVM Linear
svm_Linear <- train(data.matrix(beverage_train), PH_train, method = "svmLinear", preProc = c("center","scale"), trControl = trainControl(method = "cv"))
svm_Linear## Support Vector Machines with Linear Kernel
##
## 2056 samples
## 22 predictor
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1850, 1851, 1851, 1851, 1851, 1851, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1418761 0.3166113 0.1079261
##
## Tuning parameter 'C' was held constant at a value of 1
svm_LinearPred <- svm_Linear|>predict(data.matrix(beverage_test))
svm_Linear_per<-postResample(pred = svm_LinearPred, obs = PH_test) set.seed(12345)
library(randomForest)## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## 载入程辑包:'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
PH_train2<-cbind(beverage_train, PH_train)
rf <- randomForest(PH_train ~., data = PH_train2, importance = TRUE, ntree = 1000)
rf##
## Call:
## randomForest(formula = PH_train ~ ., data = PH_train2, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 0.0109958
## % Var explained: 62.18
rfPred <- rf|>predict(beverage_test)
rf_per<-postResample(pred = rfPred, obs = PH_test) Depth = 4, ntrees = 10000, shrinkage = .007
set.seed(12345)
library(gbm)## Loaded gbm 2.1.8.1
boosted <- gbm(
formula = PH_train ~., data = PH_train2[,-1],
distribution = "gaussian",
n.trees = 10000,
interaction.depth = 4,
shrinkage = 0.007,
cv.folds = 10,
n.cores = NULL,
verbose = FALSE
)
boosted## gbm(formula = PH_train ~ ., distribution = "gaussian", data = PH_train2[,
## -1], n.trees = 10000, interaction.depth = 4, shrinkage = 0.007,
## cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 10000 iterations were performed.
## The best cross-validation iteration was 10000.
## There were 21 predictors of which 21 had non-zero influence.
print(paste("Number of trees:", boosted$n.trees))## [1] "Number of trees: 10000"
boostPred <- predict(boosted, n.trees = boosted$n.trees, newdata = as.data.frame(beverage_test))
boost_per<-postResample(pred = boostPred, obs = PH_test) set.seed(12345)
library(Cubist)
cubist <- cubist(beverage_train, PH_train)
cubist##
## Call:
## cubist.default(x = beverage_train, y = PH_train)
##
## Number of samples: 2056
## Number of predictors: 22
##
## Number of committees: 1
## Number of rules: 24
cubistPred <- predict(cubist, newdata = beverage_test)
cubist_per<-postResample(pred = cubistPred, obs = PH_test) #Create a summary for all models:
rbind(OLS_per, elastic_per, pls_per, knn_per, nn_per, mars_per, svm_Radial_per, svm_Linear_per, rf_per, boost_per, cubist_per)## RMSE Rsquared MAE
## OLS_per 0.13484747 0.4184747 0.10496797
## elastic_per 0.14367026 0.3543933 0.11135393
## pls_per 0.14209470 0.3544839 0.10793878
## knn_per 0.12831982 0.4897171 0.09804966
## nn_per 0.14031098 0.3694024 0.10466801
## mars_per 0.13036106 0.4563444 0.09816436
## svm_Radial_per 0.12579316 0.5008554 0.09117058
## svm_Linear_per 0.14395534 0.3461752 0.10781250
## rf_per 0.09994053 0.7113584 0.07322895
## boost_per 0.10775088 0.6304643 0.08003457
## cubist_per 0.12128567 0.5402787 0.08648989
**It looks like Random Forest and Boosted are the best models from both RMSE and Rsquared.
lmod_Imp<-varImp(lmod_step)
rf_Imp<-varImp(rf)
summary.gbm(boosted)We tested 11 machine learning on beverage data in order to predict PH. OUr final model is tree-based (Random Forest). The Model R-Squared is 69%, which is not a good model. More work is required to be done for predictors to have a better model.