This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
library(tidyverse)
library(fpp3)
library(caret)
library(RANN)
library(psych)
library(DataExplorer)
library(randomForest)
library(Cubist)
library(rpart)
library(writexl)
library(openxlsx)
library(corrplot)
library(mice)
library(earth)
Historical data was provided in an Excel document. The data is read into a dataframe and evaluated.
#train.df = read.csv("StudentData.csv")
#test.df = read.csv("StudentEvaluation.csv")
train.df <- read.xlsx('StudentData.xlsx', sheet = 1)
eval.df <- read.xlsx('StudentEvaluation.xlsx', sheet = 1)
glimpse(train.df)
## Rows: 2,571
## Columns: 33
## $ Brand.Code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ Carb.Volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ Fill.Ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ PC.Volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ Carb.Pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ Carb.Temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ PSC <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ PSC.Fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ PSC.CO2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ Mnf.Flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ Carb.Pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ Fill.Pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ Hyd.Pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Hyd.Pressure4 <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ Filler.Level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ Filler.Speed <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ Temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ Usage.cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ Carb.Flow <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ Density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ MFR <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ Balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ Pressure.Vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ PH <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ Oxygen.Filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ Bowl.Setpoint <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ Pressure.Setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ Air.Pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ Alch.Rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ Carb.Rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ Balling.Lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
Brand.Code is the only categorical predictor in the data
set and needs to be converted to a factor.
#train.df$Brand.Code = as.factor(train.df$Brand.Code)
#test.df$Brand.Code = as.factor(test.df$Brand.Code)
train.df <- train.df |>
mutate(Brand.Code = as.factor(Brand.Code))
eval.df <- eval.df |>
mutate(Brand.Code = as.factor(Brand.Code))
train.df |>
ggplot() +
geom_bar(aes(x = Brand.Code)) +
labs(title = 'Distribution of Brand.Code')
PH is the response variable and the remaining are
predcictors for that response.
train.df |>
keep(is.numeric ) |>
summary()
## Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00
## 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917 1st Qu.:65.60
## Median :5.347 Median :23.97 Median :0.27133 Median :68.20
## Mean :5.370 Mean :23.97 Mean :0.27712 Mean :68.19
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200 3rd Qu.:70.60
## Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40
## NA's :10 NA's :38 NA's :39 NA's :27
## Carb.Temp PSC PSC.Fill PSC.CO2
## Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000
## 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000
## Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000
## Mean :141.1 Mean :0.08457 Mean :0.1954 Mean :0.05641
## 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000
## Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000
## NA's :26 NA's :33 NA's :23 NA's :39
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure1
## Min. :-100.20 Min. :105.6 Min. :34.60 Min. :-0.80
## 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00 1st Qu.: 0.00
## Median : 65.20 Median :123.2 Median :46.40 Median :11.40
## Mean : 24.57 Mean :122.6 Mean :47.92 Mean :12.44
## 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:20.20
## Max. : 229.40 Max. :140.2 Max. :60.40 Max. :58.00
## NA's :2 NA's :32 NA's :22 NA's :11
## Hyd.Pressure2 Hyd.Pressure3 Hyd.Pressure4 Filler.Level
## Min. : 0.00 Min. :-1.20 Min. : 52.00 Min. : 55.8
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 98.3
## Median :28.60 Median :27.60 Median : 96.00 Median :118.4
## Mean :20.96 Mean :20.46 Mean : 96.29 Mean :109.3
## 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00 3rd Qu.:120.0
## Max. :59.40 Max. :50.00 Max. :142.00 Max. :161.2
## NA's :15 NA's :15 NA's :30 NA's :20
## Filler.Speed Temperature Usage.cont Carb.Flow Density
## Min. : 998 Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240
## 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144 1st Qu.:0.900
## Median :3982 Median :65.60 Median :21.79 Median :3028 Median :0.980
## Mean :3687 Mean :65.97 Mean :20.99 Mean :2468 Mean :1.174
## 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186 3rd Qu.:1.620
## Max. :4030 Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920
## NA's :57 NA's :14 NA's :5 NA's :2 NA's :1
## MFR Balling Pressure.Vacuum PH
## Min. : 31.4 Min. :-0.170 Min. :-6.600 Min. :7.880
## 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440
## Median :724.0 Median : 1.648 Median :-5.400 Median :8.540
## Mean :704.0 Mean : 2.198 Mean :-5.216 Mean :8.546
## 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680
## Max. :868.6 Max. : 4.012 Max. :-3.600 Max. :9.360
## NA's :212 NA's :1 NA's :4
## Oxygen.Filler Bowl.Setpoint Pressure.Setpoint Air.Pressurer
## Min. :0.00240 Min. : 70.0 Min. :44.00 Min. :140.8
## 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2
## Median :0.03340 Median :120.0 Median :46.00 Median :142.6
## Mean :0.04684 Mean :109.3 Mean :47.62 Mean :142.8
## 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0
## Max. :0.40000 Max. :140.0 Max. :52.00 Max. :148.2
## NA's :12 NA's :2 NA's :12
## Alch.Rel Carb.Rel Balling.Lvl
## Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :6.560 Median :5.400 Median :1.48
## Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
We constructed a correlation plot to determine which predictors are
related to PH, and also to determine if there are
predictors that are highly related to other variables.
train.df |>
keep(is.numeric) |>
na.omit() |>
cor() |>
corrplot(tl.col = 'black', tl.cex = .6)
Look at the chart, there are several variables that have a very low
correlation to PH, for example, Carb.Temp,
PSC, PSC.Fill, and PSC.CO2, seem
to have little relationship to PH.
We want to look for variables that have near-zero variance which
indicate that they play little to no role in determining the
PH of the beverage.
train.df |>
nearZeroVar(saveMetrics = TRUE) |>
filter(nzv == TRUE)
## freqRatio percentUnique zeroVar nzv
## Hyd.Pressure1 31.11111 9.529366 FALSE TRUE
The predictor Hyd.Pressure1 has near-zero variance and
will be removed from the model.
Many of the predictors are missing values that will need to be evaluated and imputed to develop a consistent model.
plot_missing(train.df)
Using the mice package with Predictive Mean Matching to impute the
missing values. Then removing Hyd.Pressure1 which has
near-zero variance.
#train.impute.df = predict(preProcess(train.df, method="bagImpute"), train.df)
impute.df <- train.df |>
mice(m = 1, method = 'pmm', print = FALSE) |>
complete()
impute.df <- impute.df[,-nearZeroVar(impute.df)]
summary(impute.df)
## Brand.Code Carb.Volume Fill.Ounces PC.Volume Carb.Pressure
## A: 313 Min. :5.040 Min. :23.63 Min. :0.07933 Min. :57.00
## B:1315 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23933 1st Qu.:65.60
## C: 324 Median :5.347 Median :23.97 Median :0.27133 Median :68.20
## D: 619 Mean :5.370 Mean :23.97 Mean :0.27789 Mean :68.21
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31267 3rd Qu.:70.60
## Max. :5.700 Max. :24.32 Max. :0.47800 Max. :79.40
## Carb.Temp PSC PSC.Fill PSC.CO2
## Min. :128.6 Min. :0.00200 Min. :0.0000 Min. :0.00000
## 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000 1st Qu.:0.02000
## Median :140.8 Median :0.07600 Median :0.1800 Median :0.04000
## Mean :141.1 Mean :0.08461 Mean :0.1962 Mean :0.05662
## 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600 3rd Qu.:0.08000
## Max. :154.0 Max. :0.27000 Max. :0.6200 Max. :0.24000
## Mnf.Flow Carb.Pressure1 Fill.Pressure Hyd.Pressure2
## Min. :-100.20 Min. :105.6 Min. :34.60 Min. : 0.00
## 1st Qu.:-100.00 1st Qu.:118.8 1st Qu.:46.00 1st Qu.: 0.00
## Median : 64.80 Median :123.2 Median :46.40 Median :28.60
## Mean : 24.47 Mean :122.6 Mean :47.92 Mean :20.97
## 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00 3rd Qu.:34.60
## Max. : 229.40 Max. :140.2 Max. :60.40 Max. :59.40
## Hyd.Pressure3 Hyd.Pressure4 Filler.Level Filler.Speed
## Min. :-1.20 Min. : 52.00 Min. : 55.8 Min. : 998
## 1st Qu.: 0.00 1st Qu.: 86.00 1st Qu.: 97.7 1st Qu.:3819
## Median :27.40 Median : 96.00 Median :118.4 Median :3980
## Mean :20.44 Mean : 96.55 Mean :109.2 Mean :3638
## 3rd Qu.:33.30 3rd Qu.:102.00 3rd Qu.:120.0 3rd Qu.:3996
## Max. :50.00 Max. :142.00 Max. :161.2 Max. :4030
## Temperature Usage.cont Carb.Flow Density MFR
## Min. :63.60 Min. :12.08 Min. : 26 Min. :0.240 Min. : 31.4
## 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1151 1st Qu.:0.900 1st Qu.:695.0
## Median :65.60 Median :21.80 Median :3028 Median :0.980 Median :721.4
## Mean :65.98 Mean :20.99 Mean :2469 Mean :1.174 Mean :668.9
## 3rd Qu.:66.40 3rd Qu.:23.76 3rd Qu.:3187 3rd Qu.:1.620 3rd Qu.:730.4
## Max. :76.20 Max. :25.90 Max. :5104 Max. :1.920 Max. :868.6
## Balling Pressure.Vacuum PH Oxygen.Filler
## Min. :-0.170 Min. :-6.600 Min. :7.880 Min. :0.00240
## 1st Qu.: 1.496 1st Qu.:-5.600 1st Qu.:8.440 1st Qu.:0.02200
## Median : 1.648 Median :-5.400 Median :8.540 Median :0.03340
## Mean : 2.197 Mean :-5.216 Mean :8.546 Mean :0.04699
## 3rd Qu.: 3.292 3rd Qu.:-5.000 3rd Qu.:8.680 3rd Qu.:0.06000
## Max. : 4.012 Max. :-3.600 Max. :9.360 Max. :0.40000
## Bowl.Setpoint Pressure.Setpoint Air.Pressurer Alch.Rel
## Min. : 70.0 Min. :44.00 Min. :140.8 Min. :5.280
## 1st Qu.:100.0 1st Qu.:46.00 1st Qu.:142.2 1st Qu.:6.540
## Median :120.0 Median :46.00 Median :142.6 Median :6.560
## Mean :109.3 Mean :47.61 Mean :142.8 Mean :6.897
## 3rd Qu.:120.0 3rd Qu.:50.00 3rd Qu.:143.0 3rd Qu.:7.230
## Max. :140.0 Max. :52.00 Max. :148.2 Max. :8.620
## Carb.Rel Balling.Lvl
## Min. :4.960 Min. :0.00
## 1st Qu.:5.340 1st Qu.:1.38
## Median :5.400 Median :1.48
## Mean :5.436 Mean :2.05
## 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :6.060 Max. :3.66
We will create an 80/20 split of the data so that we can evaluate the effectiveness of each model.
Construct a Linear Regression Model with the training set and evaluate against the test set.
set.seed(200)
lm.model <- train(train.x, train.y,
method = "lm",
trControl = trainControl(method = "cv",number = 10))
lm.model
## Linear Regression
##
## 2058 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1330872 0.4011177 0.1040745
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
lm.pred <- predict(lm.model, newdata = test.x)
postResample(pred = lm.pred, obs = test.y)
## RMSE Rsquared MAE
## 0.1396969 0.3764328 0.1043370
Construct a Partial Least Squares Model with the training set and evaluate against the test set.
set.seed(200)
pls.model <- train(train.x, train.y,
method='pls',
tuneLength=20,
trControl=trainControl(method='cv', number=10),
preProc=c('center','scale'))
pls.model
## Partial Least Squares
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1525958 0.2112304 0.1218642
## 2 0.1450553 0.2864518 0.1140696
## 3 0.1414494 0.3225656 0.1113618
## 4 0.1394063 0.3421337 0.1097266
## 5 0.1367553 0.3668592 0.1071295
## 6 0.1351591 0.3816178 0.1065537
## 7 0.1344708 0.3882548 0.1056289
## 8 0.1341161 0.3914107 0.1054017
## 9 0.1338555 0.3938736 0.1048611
## 10 0.1336656 0.3956137 0.1046495
## 11 0.1335530 0.3964762 0.1046391
## 12 0.1334661 0.3971635 0.1044665
## 13 0.1333100 0.3984365 0.1044862
## 14 0.1332473 0.3991241 0.1044555
## 15 0.1332402 0.3992458 0.1043565
## 16 0.1332062 0.3995973 0.1044712
## 17 0.1331298 0.4003092 0.1043198
## 18 0.1331790 0.3999197 0.1042926
## 19 0.1330622 0.4009587 0.1042041
## 20 0.1330692 0.4009621 0.1042114
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 19.
plot(pls.model)
pls.pred <- predict(pls.model, newdata = test.x)
postResample(pred = pls.pred, obs = test.y)
## RMSE Rsquared MAE
## 0.1398721 0.3749599 0.1047261
Construct a Multivariate Adaptive Regression Spline Model with the training set and evaluate against the test set.
set.seed(200)
mars.model <- train(x = train.x,
y = train.y,
method='earth',
tuneGrid = expand.grid(.degree = 1:2, .nprune = 2:15),
preProcess = c('center','scale'),
tuneLength = 10,
trControl = trainControl(method='cv'))
mars.model
## Multivariate Adaptive Regression Spline
##
## 2058 samples
## 31 predictor
##
## Pre-processing: centered (30), scaled (30), ignore (1)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.1519593 0.2173347 0.11981987
## 1 3 0.1436064 0.3027557 0.11341513
## 1 4 0.1424996 0.3134170 0.11209297
## 1 5 0.1412752 0.3250453 0.11126242
## 1 6 0.1390703 0.3454738 0.10888656
## 1 7 0.1373233 0.3618968 0.10741878
## 1 8 0.1358347 0.3754513 0.10614550
## 1 9 0.1348121 0.3855048 0.10475480
## 1 10 0.1332075 0.3998987 0.10384130
## 1 11 0.1322520 0.4081996 0.10266059
## 1 12 0.1318186 0.4122003 0.10224431
## 1 13 0.1300985 0.4274200 0.10114811
## 1 14 0.1298456 0.4295064 0.10086508
## 1 15 0.1292671 0.4343528 0.10048585
## 2 2 0.1519593 0.2173347 0.11981987
## 2 3 0.1444026 0.2955454 0.11423636
## 2 4 0.1423315 0.3147275 0.11232971
## 2 5 0.1400932 0.3363659 0.11049135
## 2 6 0.1388901 0.3474765 0.10883481
## 2 7 0.1377243 0.3596276 0.10770676
## 2 8 0.1361951 0.3737595 0.10571512
## 2 9 0.1351186 0.3846038 0.10471649
## 2 10 0.1330222 0.4038920 0.10284740
## 2 11 0.1310934 0.4204518 0.10163345
## 2 12 0.1294455 0.4342111 0.09953203
## 2 13 0.1283521 0.4441439 0.09859935
## 2 14 0.1270568 0.4550686 0.09770187
## 2 15 0.1263514 0.4610164 0.09681491
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
mars.pred <- predict(mars.model, newdata = test.x)
postResample(pred = mars.pred, obs = test.y)
## RMSE Rsquared MAE
## 0.1366891 0.4027412 0.1001467
Construct a Boosted Tree Model with the training set and evaluate against the test set.
set.seed(200)
gbm.model <- train(train.x, train.y,
method="gbm",
trControl=trainControl(method="cv",n=10),
tuneGrid=expand.grid(interaction.depth = seq(1,7,by=2),
n.trees=seq(100, 1000, by=50),
shrinkage=c(0.01,0.1,by=0.01),
n.minobsinnode=10),
verbose=FALSE)
gbm.model
## Stochastic Gradient Boosting
##
## 2058 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 100 0.1523530 0.3119592 0.12203182
## 0.01 1 150 0.1474505 0.3415488 0.11788436
## 0.01 1 200 0.1438972 0.3605049 0.11490128
## 0.01 1 250 0.1411843 0.3738853 0.11272101
## 0.01 1 300 0.1390103 0.3840941 0.11094965
## 0.01 1 350 0.1374860 0.3897144 0.10971008
## 0.01 1 400 0.1363454 0.3942788 0.10877506
## 0.01 1 450 0.1354198 0.3990871 0.10801896
## 0.01 1 500 0.1347201 0.4025222 0.10745010
## 0.01 1 550 0.1340625 0.4070181 0.10691967
## 0.01 1 600 0.1334963 0.4107255 0.10646798
## 0.01 1 650 0.1329929 0.4143471 0.10604720
## 0.01 1 700 0.1325657 0.4174464 0.10565816
## 0.01 1 750 0.1321634 0.4202061 0.10530869
## 0.01 1 800 0.1317646 0.4230349 0.10498552
## 0.01 1 850 0.1314161 0.4254842 0.10467934
## 0.01 1 900 0.1311301 0.4274881 0.10442771
## 0.01 1 950 0.1308327 0.4298861 0.10417909
## 0.01 1 1000 0.1305332 0.4320379 0.10389071
## 0.01 3 100 0.1415708 0.4190988 0.11348201
## 0.01 3 150 0.1352931 0.4413851 0.10819917
## 0.01 3 200 0.1312982 0.4567712 0.10479737
## 0.01 3 250 0.1287164 0.4690652 0.10247155
## 0.01 3 300 0.1268598 0.4790437 0.10081947
## 0.01 3 350 0.1253208 0.4883158 0.09945864
## 0.01 3 400 0.1241145 0.4956628 0.09833239
## 0.01 3 450 0.1232290 0.5011062 0.09746881
## 0.01 3 500 0.1223970 0.5066522 0.09661292
## 0.01 3 550 0.1216553 0.5113117 0.09586007
## 0.01 3 600 0.1209650 0.5159266 0.09513149
## 0.01 3 650 0.1204203 0.5193619 0.09455273
## 0.01 3 700 0.1198866 0.5229287 0.09396714
## 0.01 3 750 0.1194500 0.5258525 0.09347293
## 0.01 3 800 0.1190126 0.5285590 0.09304575
## 0.01 3 850 0.1185584 0.5317001 0.09256604
## 0.01 3 900 0.1181479 0.5345187 0.09214940
## 0.01 3 950 0.1178040 0.5366624 0.09177567
## 0.01 3 1000 0.1174374 0.5391947 0.09139282
## 0.01 5 100 0.1370860 0.4676510 0.10932691
## 0.01 5 150 0.1302670 0.4879111 0.10339200
## 0.01 5 200 0.1259103 0.5044192 0.09964645
## 0.01 5 250 0.1230882 0.5170579 0.09712308
## 0.01 5 300 0.1211207 0.5267196 0.09530564
## 0.01 5 350 0.1195372 0.5358168 0.09382259
## 0.01 5 400 0.1182392 0.5434687 0.09258626
## 0.01 5 450 0.1172179 0.5494681 0.09151781
## 0.01 5 500 0.1164673 0.5533239 0.09076463
## 0.01 5 550 0.1157392 0.5576540 0.09005657
## 0.01 5 600 0.1151108 0.5614288 0.08948031
## 0.01 5 650 0.1145076 0.5652082 0.08885590
## 0.01 5 700 0.1140258 0.5680074 0.08835541
## 0.01 5 750 0.1135324 0.5711834 0.08787691
## 0.01 5 800 0.1130509 0.5743438 0.08738190
## 0.01 5 850 0.1125362 0.5776265 0.08688094
## 0.01 5 900 0.1121588 0.5800251 0.08656658
## 0.01 5 950 0.1117764 0.5823843 0.08615613
## 0.01 5 1000 0.1115184 0.5838670 0.08582951
## 0.01 7 100 0.1342139 0.4982731 0.10684194
## 0.01 7 150 0.1270261 0.5167426 0.10057407
## 0.01 7 200 0.1224944 0.5327363 0.09654200
## 0.01 7 250 0.1194406 0.5461478 0.09381487
## 0.01 7 300 0.1174010 0.5556221 0.09186063
## 0.01 7 350 0.1157821 0.5640629 0.09032519
## 0.01 7 400 0.1145704 0.5706573 0.08914123
## 0.01 7 450 0.1135441 0.5765679 0.08812839
## 0.01 7 500 0.1126344 0.5819316 0.08727011
## 0.01 7 550 0.1119832 0.5855958 0.08658761
## 0.01 7 600 0.1113210 0.5891498 0.08595459
## 0.01 7 650 0.1107006 0.5928040 0.08534412
## 0.01 7 700 0.1102049 0.5957555 0.08482307
## 0.01 7 750 0.1097457 0.5984771 0.08436663
## 0.01 7 800 0.1092611 0.6013617 0.08391944
## 0.01 7 850 0.1088465 0.6040215 0.08344786
## 0.01 7 900 0.1085133 0.6060229 0.08307932
## 0.01 7 950 0.1081617 0.6082807 0.08275758
## 0.01 7 1000 0.1078023 0.6105489 0.08238882
## 0.10 1 100 0.1305678 0.4312207 0.10395219
## 0.10 1 150 0.1282054 0.4477566 0.10143194
## 0.10 1 200 0.1274292 0.4523396 0.10037926
## 0.10 1 250 0.1267606 0.4565717 0.09936376
## 0.10 1 300 0.1264646 0.4584828 0.09888923
## 0.10 1 350 0.1261117 0.4606947 0.09843724
## 0.10 1 400 0.1259148 0.4620522 0.09786859
## 0.10 1 450 0.1256598 0.4642112 0.09759580
## 0.10 1 500 0.1256141 0.4645380 0.09753160
## 0.10 1 550 0.1254855 0.4654027 0.09740397
## 0.10 1 600 0.1253991 0.4661349 0.09733180
## 0.10 1 650 0.1254600 0.4657393 0.09728092
## 0.10 1 700 0.1255502 0.4650022 0.09740311
## 0.10 1 750 0.1255903 0.4646530 0.09730425
## 0.10 1 800 0.1256046 0.4647453 0.09729416
## 0.10 1 850 0.1256222 0.4649123 0.09723545
## 0.10 1 900 0.1255960 0.4651797 0.09715294
## 0.10 1 950 0.1254433 0.4663054 0.09695375
## 0.10 1 1000 0.1253694 0.4669895 0.09700719
## 0.10 3 100 0.1173833 0.5370013 0.09132690
## 0.10 3 150 0.1157319 0.5474978 0.08937931
## 0.10 3 200 0.1143793 0.5568460 0.08791406
## 0.10 3 250 0.1135167 0.5630276 0.08693628
## 0.10 3 300 0.1127019 0.5690655 0.08618343
## 0.10 3 350 0.1121714 0.5730238 0.08586251
## 0.10 3 400 0.1120209 0.5743261 0.08561256
## 0.10 3 450 0.1116947 0.5763781 0.08509327
## 0.10 3 500 0.1114750 0.5782678 0.08495813
## 0.10 3 550 0.1110443 0.5813300 0.08460982
## 0.10 3 600 0.1108103 0.5832841 0.08445083
## 0.10 3 650 0.1110192 0.5816803 0.08461997
## 0.10 3 700 0.1106782 0.5844288 0.08421633
## 0.10 3 750 0.1106600 0.5847471 0.08392635
## 0.10 3 800 0.1102981 0.5874271 0.08358527
## 0.10 3 850 0.1102917 0.5876629 0.08353160
## 0.10 3 900 0.1100556 0.5897448 0.08343610
## 0.10 3 950 0.1098820 0.5910939 0.08341153
## 0.10 3 1000 0.1097981 0.5918695 0.08328473
## 0.10 5 100 0.1134493 0.5661261 0.08746607
## 0.10 5 150 0.1113925 0.5802328 0.08472383
## 0.10 5 200 0.1106609 0.5844985 0.08422365
## 0.10 5 250 0.1095775 0.5924311 0.08344208
## 0.10 5 300 0.1087468 0.5983458 0.08278681
## 0.10 5 350 0.1084243 0.6008193 0.08239929
## 0.10 5 400 0.1081081 0.6032341 0.08213135
## 0.10 5 450 0.1081318 0.6029802 0.08203610
## 0.10 5 500 0.1081481 0.6030625 0.08196245
## 0.10 5 550 0.1079318 0.6044821 0.08174909
## 0.10 5 600 0.1076626 0.6066524 0.08142384
## 0.10 5 650 0.1074597 0.6081775 0.08129023
## 0.10 5 700 0.1072053 0.6101711 0.08114198
## 0.10 5 750 0.1072177 0.6102686 0.08110459
## 0.10 5 800 0.1072049 0.6105270 0.08100676
## 0.10 5 850 0.1071836 0.6107098 0.08100749
## 0.10 5 900 0.1072853 0.6101251 0.08110176
## 0.10 5 950 0.1072663 0.6103520 0.08114777
## 0.10 5 1000 0.1073540 0.6095887 0.08129552
## 0.10 7 100 0.1100874 0.5904037 0.08385399
## 0.10 7 150 0.1078922 0.6060368 0.08170157
## 0.10 7 200 0.1068334 0.6129153 0.08063340
## 0.10 7 250 0.1065951 0.6142656 0.08016653
## 0.10 7 300 0.1065075 0.6148422 0.07981809
## 0.10 7 350 0.1062637 0.6167367 0.07974546
## 0.10 7 400 0.1065941 0.6143933 0.07994070
## 0.10 7 450 0.1063581 0.6161393 0.07975480
## 0.10 7 500 0.1063456 0.6161974 0.07968278
## 0.10 7 550 0.1060731 0.6183322 0.07940382
## 0.10 7 600 0.1062205 0.6172598 0.07944048
## 0.10 7 650 0.1062470 0.6171154 0.07947978
## 0.10 7 700 0.1062981 0.6167126 0.07952513
## 0.10 7 750 0.1064330 0.6158421 0.07963543
## 0.10 7 800 0.1063724 0.6163675 0.07953334
## 0.10 7 850 0.1062828 0.6171753 0.07945210
## 0.10 7 900 0.1061756 0.6179672 0.07941516
## 0.10 7 950 0.1061899 0.6180009 0.07940876
## 0.10 7 1000 0.1061546 0.6183485 0.07939927
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 550, interaction.depth =
## 7, shrinkage = 0.1 and n.minobsinnode = 10.
gbm.pred <- predict(gbm.model, newdata = test.x)
postResample(pred = gbm.pred, obs = test.y)
## RMSE Rsquared MAE
## 0.11542839 0.57373535 0.08164853
Construct a Random Forest Model with the training set and evaluate against the test set.
set.seed(200)
rf.model <- randomForest(train.x, train.y,
importance=TRUE,
ntree=2000)
rf.model
##
## Call:
## randomForest(x = train.x, y = train.y, ntree = 2000, importance = TRUE)
## Type of random forest: regression
## Number of trees: 2000
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 0.009253301
## % Var explained: 68.52
rf.pred <- predict(rf.model, newdata = test.x)
postResample(pred = rf.pred, obs = test.y)
## RMSE Rsquared MAE
## 0.10971015 0.63248324 0.07551098
Construct a Classification and Regression Tree Model with the training set and evaluate against the test set.
set.seed(200)
cart.model <- train(train.x, train.y,
method="rpart",
tuneLength = 10,
trControl=trainControl(method="cv"))
cart.model
## CART
##
## 2058 samples
## 31 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01185559 0.1296635 0.4308978 0.1010382
## 0.01339721 0.1301769 0.4256272 0.1014535
## 0.01441989 0.1322061 0.4074353 0.1036957
## 0.01445036 0.1322061 0.4074353 0.1036957
## 0.01823303 0.1351298 0.3810924 0.1059655
## 0.02174111 0.1372140 0.3621558 0.1084246
## 0.03349671 0.1401799 0.3338156 0.1109589
## 0.04196751 0.1435611 0.3018764 0.1144110
## 0.07211729 0.1495527 0.2418983 0.1176743
## 0.21348839 0.1672073 0.1635403 0.1340677
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01185559.
cart.pred <- predict(cart.model, newdata = test.x)
postResample(pred = cart.pred, obs = test.y)
## RMSE Rsquared MAE
## 0.1376576 0.3937274 0.1040229
Construct an SVM Model with the training set and evaluate against the
test set. This model requires that the categorical predictor,
Brand.Code, be removed.
set.seed(200)
svm.model <- train(train.x[,-1], train.y,
method = 'svmRadial',
preProcess = c('center','scale'),
tuneLength = 10,
trControl = trainControl(method = 'cv'))
svm.model
## Support Vector Machines with Radial Basis Function Kernel
##
## 2058 samples
## 30 predictor
##
## Pre-processing: centered (30), scaled (30)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1853, 1852, 1851, 1852, 1852, 1852, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1317637 0.4222040 0.09825709
## 0.50 0.1288566 0.4449897 0.09477889
## 1.00 0.1266608 0.4624419 0.09214453
## 2.00 0.1249498 0.4757384 0.09054553
## 4.00 0.1236949 0.4864580 0.08973337
## 8.00 0.1233301 0.4929074 0.08985149
## 16.00 0.1253959 0.4844122 0.09207591
## 32.00 0.1286991 0.4712901 0.09485684
## 64.00 0.1322165 0.4589403 0.09822307
## 128.00 0.1373938 0.4406352 0.10247597
##
## Tuning parameter 'sigma' was held constant at a value of 0.02503082
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02503082 and C = 8.
svm.pred = predict(svm.model, newdata = test.x[,-1])
postResample(pred = svm.pred, obs = test.y)
## RMSE Rsquared MAE
## 0.12551989 0.49840999 0.08851726
results <- data.frame(as.list(postResample(pred = lm.pred, obs = test.y))) |> mutate(model = 'LM') |>
union(data.frame(as.list(postResample(pred = pls.pred, obs = test.y))) |> mutate(model = 'PLS')) |>
union(data.frame(as.list(postResample(pred = mars.pred, obs = test.y))) |> mutate(model = 'MARS')) |>
union(data.frame(as.list(postResample(pred = gbm.pred, obs = test.y))) |> mutate(model = 'TREE')) |>
union(data.frame(as.list(postResample(pred = rf.pred, obs = test.y))) |> mutate(model = 'RF')) |>
union(data.frame(as.list(postResample(pred = cart.pred, obs = test.y))) |> mutate(model = 'CART')) |>
union(data.frame(as.list(postResample(pred = svm.pred, obs = test.y))) |> mutate(model = 'SVM')) |>
relocate(model, RMSE, Rsquared, MAE)
results |>
arrange(desc(Rsquared))
## model RMSE Rsquared MAE
## 1 RF 0.1097102 0.6324832 0.07551098
## 2 TREE 0.1154284 0.5737354 0.08164853
## 3 SVM 0.1255199 0.4984100 0.08851726
## 4 MARS 0.1366891 0.4027412 0.10014672
## 5 CART 0.1376576 0.3937274 0.10402294
## 6 LM 0.1396969 0.3764328 0.10433701
## 7 PLS 0.1398721 0.3749599 0.10472612
The Random Forest Model has the highest \(R^2\) value and will be chosen to model the
PH values.
Looking at the top 10 predictors we see that Brand.Code
and Mnf.Flow are the two most important predictors for
determining the PH.
top10 <- varImp(rf.model) |>
arrange(desc(Overall)) |>
head(10)
varImpPlot(rf.model, n.var = 10)
Now that we know which predictors are most important for determining
the PH of the beverage, we can look at the correlation
matrix to see how they relate.
impute.df |>
select(c('PH', row.names(top10))) |>
keep(is.numeric) |>
cor() |>
corrplot()
We will impute missing values in the evaluation data set using the
same method as the training set. Again we will remove
Hyd.Pressure because it has near-zero variance.
set.seed(200)
eval.impute <- eval.df |>
select(-PH) |>
mice(m = 1, method = 'pmm', print = FALSE) |>
complete() |>
mutate(PH = '') |>
select(-Hyd.Pressure1)
Using the Random Forest Model, we will predict the PH in
the evaluation data set.
eval.pred <- predict(rf.model, newdata = eval.impute)
head(eval.pred, 10)
## 1 2 3 4 5 6 7 8
## 8.571348 8.461156 8.548605 8.584590 8.517460 8.535009 8.490136 8.578362
## 9 10
## 8.546094 8.629645
Insert the computed PH into the original evaluation data
set and export to Excel.
pred.df <- eval.df |>
mutate(PH = eval.pred)
pred.df |>
write.xlsx('StudentEvaluationPreds.xlsx')
After evaluating each model, the Random Forest model was determined to be most effective on “PH”. The prediction results fall into the same range of the initial data and therefore fits the data well. This proves we can forecast the evaluation using the student data. For future research, it may be worth investigating Neural Networks and see if it yields better results.