Project Description

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach. Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Approach

For this second project, we have used a number of models available in theCaret library. As we are interested in playing different values to tune the many models to be made with this dataset, a simple 10-fold cross validation has been employed as the data set is rather small and it is a rather safe option to cross-validate the data. We will choose the best model based on the various performance metrics.

Importing The Data and Loading In the Libraries

Training <- read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\StudentData.xlsx")
Training <- Training |>
  mutate (`Brand Code` = factor(`Brand Code`))

#Replacing the spaces in the column name with '.' to prevent issues with the mice package.
colnames(Training) = gsub(" ",".",colnames(Training))

head(Training) |>
  datatable()

Data Exploration/Cleaning

Exploration

dfSummary(Training,
          style = "grid",
          plain.ascii = FALSE,
          headings=FALSE,
          tmp.img.dir = "/tmp")
## temporary images written to 'C:\tmp'
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 Brand.Code
[factor]
1. A
2. B
3. C
4. D
293 (12.0%)
1239 (50.6%)
304 (12.4%)
615 (25.1%)
2451
(95.3%)
120
(4.7%)
2 Carb.Volume
[numeric]
Mean (sd) : 5.4 (0.1)
min < med < max:
5 < 5.3 < 5.7
IQR (CV) : 0.2 (0)
101 distinct values 2561
(99.6%)
10
(0.4%)
3 Fill.Ounces
[numeric]
Mean (sd) : 24 (0.1)
min < med < max:
23.6 < 24 < 24.3
IQR (CV) : 0.1 (0)
92 distinct values 2533
(98.5%)
38
(1.5%)
4 PC.Volume
[numeric]
Mean (sd) : 0.3 (0.1)
min < med < max:
0.1 < 0.3 < 0.5
IQR (CV) : 0.1 (0.2)
454 distinct values 2532
(98.5%)
39
(1.5%)
5 Carb.Pressure
[numeric]
Mean (sd) : 68.2 (3.5)
min < med < max:
57 < 68.2 < 79.4
IQR (CV) : 5 (0.1)
106 distinct values 2544
(98.9%)
27
(1.1%)
6 Carb.Temp
[numeric]
Mean (sd) : 141.1 (4)
min < med < max:
128.6 < 140.8 < 154
IQR (CV) : 5.4 (0)
123 distinct values 2545
(99.0%)
26
(1.0%)
7 PSC
[numeric]
Mean (sd) : 0.1 (0)
min < med < max:
0 < 0.1 < 0.3
IQR (CV) : 0.1 (0.6)
143 distinct values 2538
(98.7%)
33
(1.3%)
8 PSC.Fill
[numeric]
Mean (sd) : 0.2 (0.1)
min < med < max:
0 < 0.2 < 0.6
IQR (CV) : 0.2 (0.6)
60 distinct values 2548
(99.1%)
23
(0.9%)
9 PSC.CO2
[numeric]
Mean (sd) : 0.1 (0)
min < med < max:
0 < 0 < 0.2
IQR (CV) : 0.1 (0.8)
23 distinct values 2532
(98.5%)
39
(1.5%)
10 Mnf.Flow
[numeric]
Mean (sd) : 24.6 (119.5)
min < med < max:
-100.2 < 65.2 < 229.4
IQR (CV) : 240.8 (4.9)
487 distinct values 2569
(99.9%)
2
(0.1%)
11 Carb.Pressure1
[numeric]
Mean (sd) : 122.6 (4.7)
min < med < max:
105.6 < 123.2 < 140.2
IQR (CV) : 6.4 (0)
140 distinct values 2539
(98.8%)
32
(1.2%)
12 Fill.Pressure
[numeric]
Mean (sd) : 47.9 (3.2)
min < med < max:
34.6 < 46.4 < 60.4
IQR (CV) : 4 (0.1)
108 distinct values 2549
(99.1%)
22
(0.9%)
13 Hyd.Pressure1
[numeric]
Mean (sd) : 12.4 (12.4)
min < med < max:
-0.8 < 11.4 < 58
IQR (CV) : 20.2 (1)
245 distinct values 2560
(99.6%)
11
(0.4%)
14 Hyd.Pressure2
[numeric]
Mean (sd) : 21 (16.4)
min < med < max:
0 < 28.6 < 59.4
IQR (CV) : 34.6 (0.8)
207 distinct values 2556
(99.4%)
15
(0.6%)
15 Hyd.Pressure3
[numeric]
Mean (sd) : 20.5 (16)
min < med < max:
-1.2 < 27.6 < 50
IQR (CV) : 33.4 (0.8)
192 distinct values 2556
(99.4%)
15
(0.6%)
16 Hyd.Pressure4
[numeric]
Mean (sd) : 96.3 (13.1)
min < med < max:
52 < 96 < 142
IQR (CV) : 16 (0.1)
40 distinct values 2541
(98.8%)
30
(1.2%)
17 Filler.Level
[numeric]
Mean (sd) : 109.3 (15.7)
min < med < max:
55.8 < 118.4 < 161.2
IQR (CV) : 21.7 (0.1)
288 distinct values 2551
(99.2%)
20
(0.8%)
18 Filler.Speed
[numeric]
Mean (sd) : 3687.2 (770.8)
min < med < max:
998 < 3982 < 4030
IQR (CV) : 110 (0.2)
244 distinct values 2514
(97.8%)
57
(2.2%)
19 Temperature
[numeric]
Mean (sd) : 66 (1.4)
min < med < max:
63.6 < 65.6 < 76.2
IQR (CV) : 1.2 (0)
56 distinct values 2557
(99.5%)
14
(0.5%)
20 Usage.cont
[numeric]
Mean (sd) : 21 (3)
min < med < max:
12.1 < 21.8 < 25.9
IQR (CV) : 5.4 (0.1)
481 distinct values 2566
(99.8%)
5
(0.2%)
21 Carb.Flow
[numeric]
Mean (sd) : 2468.4 (1073.7)
min < med < max:
26 < 3028 < 5104
IQR (CV) : 2042 (0.4)
533 distinct values 2569
(99.9%)
2
(0.1%)
22 Density
[numeric]
Mean (sd) : 1.2 (0.4)
min < med < max:
0.2 < 1 < 1.9
IQR (CV) : 0.7 (0.3)
78 distinct values 2570
(100.0%)
1
(0.0%)
23 MFR
[numeric]
Mean (sd) : 704 (73.9)
min < med < max:
31.4 < 724 < 868.6
IQR (CV) : 24.7 (0.1)
587 distinct values 2359
(91.8%)
212
(8.2%)
24 Balling
[numeric]
Mean (sd) : 2.2 (0.9)
min < med < max:
-0.2 < 1.6 < 4
IQR (CV) : 1.8 (0.4)
217 distinct values 2570
(100.0%)
1
(0.0%)
25 Pressure.Vacuum
[numeric]
Mean (sd) : -5.2 (0.6)
min < med < max:
-6.6 < -5.4 < -3.6
IQR (CV) : 0.6 (-0.1)
16 distinct values 2571
(100.0%)
0
(0.0%)
26 PH
[numeric]
Mean (sd) : 8.5 (0.2)
min < med < max:
7.9 < 8.5 < 9.4
IQR (CV) : 0.2 (0)
52 distinct values 2567
(99.8%)
4
(0.2%)
27 Oxygen.Filler
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.4
IQR (CV) : 0 (1)
338 distinct values 2559
(99.5%)
12
(0.5%)
28 Bowl.Setpoint
[numeric]
Mean (sd) : 109.3 (15.3)
min < med < max:
70 < 120 < 140
IQR (CV) : 20 (0.1)
11 distinct values 2569
(99.9%)
2
(0.1%)
29 Pressure.Setpoint
[numeric]
Mean (sd) : 47.6 (2)
min < med < max:
44 < 46 < 52
IQR (CV) : 4 (0)
8 distinct values 2559
(99.5%)
12
(0.5%)
30 Air.Pressurer
[numeric]
Mean (sd) : 142.8 (1.2)
min < med < max:
140.8 < 142.6 < 148.2
IQR (CV) : 0.8 (0)
32 distinct values 2571
(100.0%)
0
(0.0%)
31 Alch.Rel
[numeric]
Mean (sd) : 6.9 (0.5)
min < med < max:
5.3 < 6.6 < 8.6
IQR (CV) : 0.7 (0.1)
53 distinct values 2562
(99.6%)
9
(0.4%)
32 Carb.Rel
[numeric]
Mean (sd) : 5.4 (0.1)
min < med < max:
5 < 5.4 < 6.1
IQR (CV) : 0.2 (0)
42 distinct values 2561
(99.6%)
10
(0.4%)
33 Balling.Lvl
[numeric]
Mean (sd) : 2.1 (0.9)
min < med < max:
0 < 1.5 < 3.7
IQR (CV) : 1.8 (0.4)
82 distinct values 2570
(100.0%)
1
(0.0%)

Some of the columns looks skewed,while others are sparse and a few looks normally distributed but slightly skewed.. it appears Hyd_Press 1,2, and 3 appears to have a lot of zeros in their distribution, this is kinda worrisome those columns seems to have sparse data. It appears most of the Brand Code are B followed by D, A and C seems to be of equal proportion.

Training %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplot of Numerical Variables")

## Imputation There are missing values and some outliers that will be handled through imputation using the mice package. We will also remove any predictors with near zero variance using the nearZeroVar function from the caret library.

set.seed(1234)
Training <- Training |>
  mice(m=1,print=F) |>
  complete()

Training <- Training[, -nearZeroVar(Training)]

Correlation Matrix

Let’s take a look at the correlation matrix and see what works.

## Correlation_Matrix only works if there are no empty data and the predictors are numeric.
Training2 <- Training %>%
  keep(is.numeric)
corrplot(cor(Training2),method = "circle",type="lower")

Looking at the correlation plot, we can see some predictors are strongly correlated with other predictors i.e we see that carb volume is correlated with density, balling ,AlCh Rel and etc, Interestingly the ph(our response variable doesn’t seem to have any strong correlation with the predictors just a negative correlation..)


Pre-Process Data

Since the Brand Code column is essentially a character column, the creation of a one-hot encoding for this column was performed. As it is wise to preserve as much of the variables as possible, and most ML algorithms require columns to be numeric.

# Create the dummy variables using predict and the Y variable (PH) will not be available in the trainingset
library(caret)
dummy_mod <- dummyVars(PH~.,data = Training)
Training3Dat_Mat <- predict(dummy_mod,newdata = Training)

## Convert it to a data-frame and this is the final Training set. 
TrainingThree <- data.frame(Training3Dat_Mat)

# Need to preserve this column so I can create testing and training splits.
TrainingThree <- cbind(TrainingThree,Training$PH)
## Rename the Training3Ph column name to just PH 

TrainingThree <- TrainingThree %>%
  rename(PH = `Training$PH`)

Machine-Learning Algorithms

Here we will create,train and tune a bunch of models with the data.

PLS model

library(caret)
## Let's fit a PLS model..
set.seed(15)
# We are going to use a simple 10-fold cross validation for our re sampling purposes. and make training and testing set here..
ctrl <- trainControl(method = "cv",number = 10)
train_indicies <- createDataPartition(TrainingThree$PH,p = 0.8,list = FALSE)
df_train <- TrainingThree[train_indicies,]
df_test <- TrainingThree[-train_indicies,]
set.seed(100)
plsTune <- train(PH~.,data = df_train,method = "pls",tuneLength = 30,trControl = ctrl,preProc = c("center","scale"))
plsTune$results |> datatable()
ggplot(plsTune)

The # of components seems to have converge around the last few components.

summary(plsTune)
## Data:    X dimension: 2058 34 
##  Y dimension: 2058 1
## Fit method: oscorespls
## Number of components considered: 28
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           15.71     27.6    41.08    48.00    52.53    55.35    58.30
## .outcome    25.74     33.6    37.19    39.12    40.69    41.67    42.35
##           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X           61.06    63.46     66.32     68.97     71.23     73.39     76.23
## .outcome    42.81    43.15     43.29     43.37     43.43     43.47     43.50
##           15 comps  16 comps  17 comps  18 comps  19 comps  20 comps  21 comps
## X            77.72     79.24     80.84     82.02     84.47     86.71     88.56
## .outcome     43.57     43.60     43.63     43.68     43.70     43.71     43.71
##           22 comps  23 comps  24 comps  25 comps  26 comps  27 comps  28 comps
## X            90.25     91.69     92.72     94.44     95.55     97.38     98.42
## .outcome     43.72     43.72     43.72     43.72     43.72     43.72     43.72
plsPred <- predict(plsTune,df_test)
postResample(plsPred,df_test$PH)
##      RMSE  Rsquared       MAE 
## 0.1390888 0.3787972 0.1072777

The PLS model seems to perform well on the test data set and the R-squared is somewhat high and the RMSE is somewhat low.


PCR

Let’s attempt to perform a Principal Component Regression Model with the Caret Package.

set.seed(100)
pcrTune <- train(PH~.,data = df_train,method = "pcr",tuneLength = 30,trControl = ctrl,preProc = c("center","scale"))
pcrTune$results |>datatable()
ggplot(pcrTune)

It seems the lowest RMSE was at the 27th component.

pcrPred <- predict(pcrTune,df_test)
postResample(pcrPred,df_test$PH)
##      RMSE  Rsquared       MAE 
## 0.1389831 0.3794477 0.1075790

The PCR seems no different than from the pls regression, the pcr performed slightly worse than the pls though.


Let’s try to incorporate neural-networks and MARS in this section


Neural Networks

Neural Networks have a tendency to over fit the relationships between the predictors and the response due to the large numbers of regression coefficients you can prevent with this with early stopping and to use weight decay

lambda should be between 0 and 0.1 and should be scaled and centered

Try to perform model-averaging, since it seems to provide better results according to the textbook.

## Find highly correlated columns and remove them from the data 
tooCorrelated <- findCorrelation(cor(df_train),cutoff = .75)
trainXnet <- df_train[,-tooCorrelated]
testXnet <- df_test[,-tooCorrelated]

I mostly followed the textbook and tried to remove highly correlated predictors from my trainingset before I fed it into the neural network.

## Let's create a bunch of candidate set of models to evaluate
nnetGrid <- expand.grid(.decay = c(0,0.01,0.1),.size = c(1:5),.bag = FALSE)
## Let's train a Neural Network Model and tweak the settings. I halved the parameters since it takes the neural network a while to make
set.seed(100)
library(doParallel)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
## Loading required package: parallel
nnetModel <- train(PH~.,data = trainXnet,method = "avNNet",tuneGrid = nnetGrid,trControl = ctrl,preProc = c("center","scale"),linout = TRUE,trace = FALSE,MaxNWts = 5 * (ncol(trainXnet)+ 1) + 5 + 1,maxit = 250)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetModel$results |> datatable()
ggplot(nnetModel)

Looking at the plot it seems the two weight decay are very close together with 0.01 winning at the fifth hidden unit.

nnetPred <- predict(nnetModel,testXnet)
postResample(nnetPred,testXnet$PH)
##       RMSE   Rsquared        MAE 
## 0.12351504 0.51176183 0.09473186

The neural network seems to perform well with the current tweaking of the parameters and performed better than the pls and pcr model.

MARS Model

Let’s create a MARS model now.. There are two tuning parameters, the degree of the features that are added to the model and the # of retained terms, the latter can be determined by the default pruning procedure (GCV), model summary on page 149 and nprune is the # of retained terms. (As noted by the textbook.)

# Stick with the default resampling and use 2:30 to see how much terms get retained.
set.seed(100)
Marsgrid <- expand.grid(degree = 1:3,nprune = 2:30)
Marsmodel <- train(PH~.,data = df_train,method = "earth",tuneGrid = Marsgrid,trControl = ctrl)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.3.2
ggplot(Marsmodel)

It seems like high degrees and large prune values seems to capture most of the variance and produce really low RMSE values.

Marspred <- predict(Marsmodel,df_test)
postResample(Marspred,df_test$PH)
##       RMSE   Rsquared        MAE 
## 0.12329640 0.51203951 0.09219793

Seems like the Mars Model performed worse than the neural network, I am wondering why?, was it because of the one hot encoding used on the brand_code?


Finally, let’s create a random forest/ boosting model


Random Forest

Let’s try and create a randomforest model, two parameters that can be tuned are the complexity parameter and the maximum node depth

## Breiman suggest tuning the mtry to 1/3 of the total predictors, start with 5 random values of k, and we will use a small # of trees.. (pg 200)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(100)
rfGrid <- expand.grid(mtry = c(2,4,6,8,10))
RandomF <- train(PH~.,data = df_train,trControl = ctrl,tuneGrid = rfGrid,method = "rf")
RandomF$results |> datatable()
ggplot(RandomF)

Seems like it using between eight to ten predictors per split seems to make slight reduction on the RMSE values..

ggplot(varImp(RandomF,scale = TRUE))

Seems like Mnf Flow and Usage Cont and Brand Code C were important predictors in constructing the random forest model.

RFpred <- predict(RandomF,df_test)
postResample(RFpred,df_test$PH)
##       RMSE   Rsquared        MAE 
## 0.10202434 0.69200305 0.07291014

Not surprisingly, the randomForest did seems to perform really well on the test-data, I am going to assume that boosted trees and xgboost will perform even better on this type of data.

Boosted Trees

Let’s try to create boosted trees now.

## I will start with a simple interaction depth, and use 100 iterations of tree between 100 to 1000 and use the two shrinkage values as noted in the textbook. n.minobsinnode is the integer specifying the min number of observations in the terminal node of the tree
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 100),
.shrinkage = c(0.01, 0.1),.n.minobsinnode = 10)
set.seed(100)
gbmModel <- train(PH~.,data = df_train,tuneGrid = gbmGrid,method = "gbm",verbose = FALSE)
gbmModel$results |> datatable()
ggplot(gbmModel)

Higher shrinkage values seems to provide lower RMSE values and max tree depth which is seven in this case seems to provide the lowest. But likewise higher tree depth can overfit on the training data.

summary(gbmModel,plot = FALSE)|> datatable()

It seems like Usage cont and Mnf Flow had a strong influence on the formation of the boosted trees, followed by Brand code C and Oxygen Filler.

GBMpred <- predict(gbmModel,df_test)
postResample(GBMpred,df_test$PH)
##       RMSE   Rsquared        MAE 
## 0.11067048 0.60673144 0.08022185

Seems like the boosting model performed slightly worse than the random forest model, but it still performed well on the testing data.

XGBoost

Finally, I want to try out a popular machine learning algorithm aka xgboost, due to the nature of this algorithm I believe this particular technique will work out the best.

## https://www.youtube.com/watch?v=lwv8GneTUOk&ab_channel=DataGarden (source for tweaking the xgboost model..)
## May have to create a new tuning grid for xgboost.. gamma for 5 is suggested if u have no clue what to do look at article for tuning
xgBgrid <- expand.grid(nrounds = 100,eta = c(0.1,0.3),max_depth = c(2,4,6),colsample_bytree = c(0.5,0.9),subsample = c(0.5,0.8),gamma = 5,min_child_weight = c(0,10))
xgBoostModel <- train(PH~.,data = df_train,method = "xgbTree",objective = "reg:squarederror",trControl = ctrl,tuneGrid = xgBgrid)
ggplot(varImp(xgBoostModel))

xgbPred <- predict(xgBoostModel,df_test)
postResample(xgbPred,df_test$PH)
##      RMSE  Rsquared       MAE 
## 0.1622689 0.2303943 0.1275589

I don’t understand what happened here, but xgboost did not do well on my data set,I am trying to figure out why but for now I am unsure, I will keep this failed model here just in case.

Create The Table and Compare Results

Table <- bind_rows(pls = postResample(plsPred,df_test$PH),pcr = postResample(pcrPred,df_test$PH),NN = postResample(nnetPred,df_test$PH),Mars = postResample(Marspred,df_test$PH),RF = postResample(RFpred,df_test$PH),Boost = postResample(GBMpred,df_test$PH),xgb = postResample(xgbPred,df_test$PH))
Table$id = c("PLS","PCR","Neural Network","MARS",
             "Random Forest","Boosted Trees","xgBoost")
Table |> datatable()

Comparing all the metrics we can see that Random Forest and Boosted Trees seems to have performed the best on the test data, unfortunately xgBoost did not work in this case and it seem to have failed. Even though this is the algorithm that is popular among many data scientists.


Test -Set

I will now apply my models on the test set from the other excel file and then import them into an excel file.

But first, we will have to read and clean the data

Test <- read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\StudentEvaluation.xlsx")
Test <- Test |>
  mutate (`Brand Code` = factor(`Brand Code`))

#Replacing the spaces in the column name with '.' to prevent issues with the mice package.
colnames(Test) = gsub(" ",".",colnames(Test))

Test |> datatable()

Now I have to perform the same data-cleaning and preprocessing procedures,

dfSummary(Test,
          style = "grid",
          plain.ascii = FALSE,
          headings=FALSE,
          tmp.img.dir = "/tmp")
## temporary images written to 'C:\tmp'
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 Brand.Code
[factor]
1. A
2. B
3. C
4. D
35 (13.5%)
129 (49.8%)
31 (12.0%)
64 (24.7%)
259
(97.0%)
8
(3.0%)
2 Carb.Volume
[numeric]
Mean (sd) : 5.4 (0.1)
min < med < max:
5.1 < 5.3 < 5.7
IQR (CV) : 0.2 (0)
71 distinct values 266
(99.6%)
1
(0.4%)
3 Fill.Ounces
[numeric]
Mean (sd) : 24 (0.1)
min < med < max:
23.7 < 24 < 24.2
IQR (CV) : 0.1 (0)
52 distinct values 261
(97.8%)
6
(2.2%)
4 PC.Volume
[numeric]
Mean (sd) : 0.3 (0.1)
min < med < max:
0.1 < 0.3 < 0.5
IQR (CV) : 0.1 (0.2)
183 distinct values 263
(98.5%)
4
(1.5%)
5 Carb.Pressure
[numeric]
Mean (sd) : 68.3 (3.9)
min < med < max:
60.2 < 68 < 77.6
IQR (CV) : 5.3 (0.1)
76 distinct values 267
(100.0%)
0
(0.0%)
6 Carb.Temp
[numeric]
Mean (sd) : 141.2 (4.3)
min < med < max:
130 < 140.8 < 154
IQR (CV) : 5.4 (0)
88 distinct values 266
(99.6%)
1
(0.4%)
7 PSC
[numeric]
Mean (sd) : 0.1 (0.1)
min < med < max:
0 < 0.1 < 0.2
IQR (CV) : 0.1 (0.6)
98 distinct values 262
(98.1%)
5
(1.9%)
8 PSC.Fill
[numeric]
Mean (sd) : 0.2 (0.1)
min < med < max:
0 < 0.2 < 0.6
IQR (CV) : 0.2 (0.6)
48 distinct values 264
(98.9%)
3
(1.1%)
9 PSC.CO2
[numeric]
Mean (sd) : 0.1 (0)
min < med < max:
0 < 0 < 0.2
IQR (CV) : 0 (0.8)
18 distinct values 262
(98.1%)
5
(1.9%)
10 Mnf.Flow
[numeric]
Mean (sd) : 21 (117.8)
min < med < max:
-100.2 < 0.2 < 220.4
IQR (CV) : 241.3 (5.6)
114 distinct values 267
(100.0%)
0
(0.0%)
11 Carb.Pressure1
[numeric]
Mean (sd) : 123 (4.4)
min < med < max:
113 < 123.4 < 136
IQR (CV) : 5.3 (0)
87 distinct values 263
(98.5%)
4
(1.5%)
12 Fill.Pressure
[numeric]
Mean (sd) : 48.1 (3.4)
min < med < max:
37.8 < 47.8 < 60.2
IQR (CV) : 4.2 (0.1)
61 distinct values 265
(99.3%)
2
(0.7%)
13 Hyd.Pressure1
[numeric]
Mean (sd) : 12 (13.5)
min < med < max:
-50 < 10.4 < 50
IQR (CV) : 20.4 (1.1)
115 distinct values 267
(100.0%)
0
(0.0%)
14 Hyd.Pressure2
[numeric]
Mean (sd) : 20.1 (17.2)
min < med < max:
-50 < 26.8 < 61.4
IQR (CV) : 34.8 (0.9)
95 distinct values 266
(99.6%)
1
(0.4%)
15 Hyd.Pressure3
[numeric]
Mean (sd) : 19.6 (16.6)
min < med < max:
-50 < 27.7 < 49.2
IQR (CV) : 33 (0.8)
89 distinct values 266
(99.6%)
1
(0.4%)
16 Hyd.Pressure4
[numeric]
Mean (sd) : 97.8 (13.9)
min < med < max:
68 < 98 < 140
IQR (CV) : 14 (0.1)
34 distinct values 263
(98.5%)
4
(1.5%)
17 Filler.Level
[numeric]
Mean (sd) : 110.3 (15.5)
min < med < max:
69.2 < 118.6 < 153.2
IQR (CV) : 19.6 (0.1)
107 distinct values 265
(99.3%)
2
(0.7%)
18 Filler.Speed
[numeric]
Mean (sd) : 3581.4 (911.2)
min < med < max:
1006 < 3978 < 4020
IQR (CV) : 184 (0.3)
82 distinct values 257
(96.3%)
10
(3.7%)
19 Temperature
[numeric]
Mean (sd) : 66.2 (1.7)
min < med < max:
63.8 < 65.8 < 75.4
IQR (CV) : 1.2 (0)
36 distinct values 265
(99.3%)
2
(0.7%)
20 Usage.cont
[numeric]
Mean (sd) : 20.9 (3)
min < med < max:
12.9 < 21.4 < 24.6
IQR (CV) : 5.6 (0.1)
174 distinct values 265
(99.3%)
2
(0.7%)
21 Carb.Flow
[numeric]
Mean (sd) : 2408.6 (1161.4)
min < med < max:
0 < 3038 < 3858
IQR (CV) : 2132 (0.5)
178 distinct values 267
(100.0%)
0
(0.0%)
22 Density
[numeric]
Mean (sd) : 1.2 (0.4)
min < med < max:
0.1 < 1 < 1.8
IQR (CV) : 0.7 (0.3)
53 distinct values 266
(99.6%)
1
(0.4%)
23 MFR
[numeric]
Mean (sd) : 697.8 (96.4)
min < med < max:
15.6 < 724.6 < 784.8
IQR (CV) : 24.4 (0.1)
162 distinct values 236
(88.4%)
31
(11.6%)
24 Balling
[numeric]
Mean (sd) : 2.2 (0.9)
min < med < max:
0.9 < 1.6 < 3.8
IQR (CV) : 1.7 (0.4)
82 distinct values 266
(99.6%)
1
(0.4%)
25 Pressure.Vacuum
[numeric]
Mean (sd) : -5.2 (0.6)
min < med < max:
-6.4 < -5.2 < -3.6
IQR (CV) : 0.8 (-0.1)
15 distinct values 266
(99.6%)
1
(0.4%)
26 PH
[logical]
All NA’s 0
(0.0%)
267
(100.0%)
27 Oxygen.Filler
[numeric]
Mean (sd) : 0 (0)
min < med < max:
0 < 0 < 0.4
IQR (CV) : 0 (1.1)
149 distinct values 264
(98.9%)
3
(1.1%)
28 Bowl.Setpoint
[numeric]
Mean (sd) : 109.6 (15)
min < med < max:
70 < 120 < 130
IQR (CV) : 20 (0.1)
70 : 9 ( 3.4%)
80 : 15 ( 5.6%)
90 : 35 (13.2%)
100 : 12 ( 4.5%)
110 : 49 (18.4%)
120 : 139 (52.3%)
130 : 7 ( 2.6%)
266
(99.6%)
1
(0.4%)
29 Pressure.Setpoint
[numeric]
Mean (sd) : 47.7 (2.1)
min < med < max:
44 < 46 < 52
IQR (CV) : 4 (0)
6 distinct values 265
(99.3%)
2
(0.7%)
30 Air.Pressurer
[numeric]
Mean (sd) : 142.8 (1.2)
min < med < max:
141.2 < 142.6 < 147.2
IQR (CV) : 0.6 (0)
24 distinct values 266
(99.6%)
1
(0.4%)
31 Alch.Rel
[numeric]
Mean (sd) : 6.9 (0.5)
min < med < max:
6.4 < 6.6 < 7.8
IQR (CV) : 0.6 (0.1)
31 distinct values 264
(98.9%)
3
(1.1%)
32 Carb.Rel
[numeric]
Mean (sd) : 5.4 (0.1)
min < med < max:
5.2 < 5.4 < 5.7
IQR (CV) : 0.2 (0)
28 distinct values 265
(99.3%)
2
(0.7%)
33 Balling.Lvl
[numeric]
Mean (sd) : 2.1 (0.9)
min < med < max:
0 < 1.5 < 3.4
IQR (CV) : 1.7 (0.4)
53 distinct values 267
(100.0%)
0
(0.0%)

Glancinng at the test data, it more or less resembles the structure of the training set, and thus I will clean and preprocess the data the same way.

Clean/Preprocess The Data

Should preprocess the data into training/testing before I actually applied it to the training set.

# Create the dummy variables using predict and the Y variable (PH) will not be available in the trainingset
library(caret)
dummy_mod1 <- dummyVars(PH~.,data = Test)
Test2Dat_Mat <- predict(dummy_mod1,newdata = Test)

## Convert it to a data-frame and this is the final Training set. 
TestTwo <- data.frame(Test2Dat_Mat) |>
  na.omit()

Test Model on Cleaned Test Data

I tested my models before with NA values but the model seems to spit out results that only has columns with values in it, correction: Now we tested models with data-imputation.

PlsTest <- predict(plsTune,TestTwo)
pcRTest <- predict(pcrTune,TestTwo)
NeuralTest <- predict(nnetModel,TestTwo)
MarsTest <- predict(Marsmodel,TestTwo)
RFTest <- predict(RandomF,TestTwo)
BoostTest <- predict(gbmModel,TestTwo)
xgBoostTest <- predict(xgBoostModel,TestTwo)
TestTable <- bind_cols("PLS" = PlsTest,"PCR" = pcRTest,"Neural Network" = NeuralTest,"MARS" = MarsTest,"Random Forest" = RFTest,"Boosted Tree" = BoostTest,"xgBoost" = xgBoostTest)

TestTable
## # A tibble: 200 × 7
##      PLS   PCR `Neural Network` MARS[,"y"] `Random Forest` `Boosted Tree`
##    <dbl> <dbl>            <dbl>      <dbl>           <dbl>          <dbl>
##  1  8.60  8.59             8.47       8.47            8.52           8.47
##  2  8.60  8.60             8.51       8.54            8.51           8.54
##  3  8.52  8.52             8.52       8.48            8.49           8.40
##  4  8.58  8.57             8.48       8.53            8.51           8.49
##  5  8.69  8.67             8.52       8.62            8.56           8.58
##  6  8.59  8.60             8.51       8.51            8.53           8.56
##  7  8.59  8.59             8.52       8.55            8.48           8.53
##  8  8.64  8.63             8.74       8.71            8.61           8.63
##  9  8.42  8.43             8.38       8.34            8.31           8.28
## 10  8.62  8.63             8.65       8.59            8.59           8.52
## # ℹ 190 more rows
## # ℹ 1 more variable: xgBoost <dbl>

Write to Excel

library(writexl)
## Warning: package 'writexl' was built under R version 4.3.3
write_xlsx(TestTable,"Finalpredictions.xlsx")

Thoughts on Process

After cleaning and creating a variety of models from this data set, it seems like using Random Forest and Boosted Trees are the best options to use on the test set since they performed really well in predicting PH values. In a business setting, if we took a look at the variable importance when constructing the tree, we note that in the random forest model, MnF Flow and Usage Cont and Brand Code C seems to be important predictors when predicting PH values, These same predictors were also found when constructing the Boosted Trees. While there is not enough domain knowledge in hand, I would assume that the PH values were possibly influenced by how Brand C measured their PH?, and how the other key predictors has had an important hand in their PH values. Either way I would possibly tell stakeholders or the higher-ups that we should take careful consideration on how we handle Mnf Flow and Usage Cont and to possibly learn more on how Brand Code C handles their Ph measurements or values. Note I am really disappointing with how my xgboost model turned out I was really hoping it would outperform the random forest and the boosted trees but I wanted to keep this model in to showcase that not everything goes right in these sort of machine-learning workflow. Testing the models on the actual testing set, I noticed that if there were any Na values in anyone of the columns the model would seem to ignore giving the PH values for those particular observations, hence why the predictions in the excel file only contains 200 predictions instead of the usual 267. Ultimately, I would choose the Random Forest model because it performed really well on the testing data and the R-squared and RMSE metrics are good.