For this second project I am going to create a bunch of models using the Caret Library, since I am interested in playing different values to tune the many kinds of models I want to make with this dataset, for the resampling method I stuck with a simple 10-fold cross validation since the dataset is rather small and it is a rather safe option to cross-validate the data. I will train a bunch of models and see how well it performs on the dataset and from there I will simply choose the best model metrics.

Importing The Data and Loading In the Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
Training <- read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\StudentData.xlsx")
head(Training)
## # A tibble: 6 × 33
##   `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##   <chr>                <dbl>         <dbl>       <dbl>           <dbl>
## 1 B                     5.34          24.0       0.263            68.2
## 2 A                     5.43          24.0       0.239            68.4
## 3 B                     5.29          24.1       0.263            70.8
## 4 A                     5.44          24.0       0.293            63  
## 5 A                     5.49          24.3       0.111            67.2
## 6 A                     5.38          23.9       0.269            66.6
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## #   `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …
str(Training)
## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
##  $ Brand Code       : chr [1:2571] "B" "A" "B" "A" ...
##  $ Carb Volume      : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill Ounces      : num [1:2571] 24 24 24.1 24 24.3 ...
##  $ PC Volume        : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb Pressure    : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb Temp        : num [1:2571] 141 140 145 133 137 ...
##  $ PSC              : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC Fill         : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
##  $ PSC CO2          : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
##  $ Mnf Flow         : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num [1:2571] 119 122 120 115 118 ...
##  $ Fill Pressure    : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd Pressure1    : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler Level     : num [1:2571] 121 119 120 118 119 ...
##  $ Filler Speed     : num [1:2571] 4002 3986 4020 4012 4010 ...
##  $ Temperature      : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage cont       : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb Flow        : num [1:2571] 2932 3144 2914 3062 3054 ...
##  $ Density          : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num [1:2571] 725 727 735 731 723 ...
##  $ Balling          : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure Vacuum  : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen Filler    : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl Setpoint    : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air Pressurer    : num [1:2571] 143 143 142 146 146 ...
##  $ Alch Rel         : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb Rel         : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling Lvl      : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

Glancing at the structure of the data, we notice that there are few missing values and potentially have some predictors that are of the same value??, Also is there any significance of the brand code?


Data Exploration/Cleaning

I’ve figured out I might as well remove NA values since we don’t lose that many observations within the training set, also looking at the training set the missing values are completeley at random and there is no fixed pattern of missingness.

Training1 <- Training %>%
  na.omit()
Training1 <- Training1 %>%
  select_if(is.numeric)
Training1 <- Training1 %>%
  pivot_longer(colnames(Training1)) %>%
  as.data.frame()
ggplot(Training1,aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name,scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Some of the columns looks skewed,while others are sparse and a few looks normally distributed but slightly skewed.. it appears Hyd_Press 1,2, and 3 appears to have a lot of zeros in their distribution, this is kinda worrisome those columns seems to have sparse data.

Training %>%
  filter(!is.na(`Brand Code`)) %>%
ggplot(aes(x = `Brand Code`)) +
  geom_bar() + labs(title = "Proportion of Brand Code Within The Training Set")

It appears most of the Brand Code are B followed by D, A and C seems to be of equal proportion.

Correlation Matrix

Let’s take a look at the correlation matrix and see what works.

## Correlation_Matrix only works if there are no empty data and the predictors are numeric.
Training2 <- Training %>%
  select_if(is.numeric) %>%
  na.omit()
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Training2),method = "circle")

Looking at the correlation plot, we can see some predictors are strongly correlated with other predictors i.e we see that carb volume is correlated with density, balling ,AlCh Rel and etc, Interestingly the ph(our response variable doesn’t seem to have any strong correlation with the predictors just a negative correlation..)


Pre-Process Data

Since the Brand Code column is essentially a character column I figure I may as well create one-hot encoding for this column since it may be wise to preserve as much of the variables as I can. and most ml algorithms require columns to be numeric.

# Training3 is going to be the final cleaned up version 
Training3 <- Training %>%
  na.omit()

Should preprocess the data into training/testing before I actually applied it to the test set.

# Create the dummy variables using predict and the Y variable (PH) will not be available in the trainingset
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
dummy_mod <- dummyVars(PH~.,data = Training3)
Training3Dat_Mat <- predict(dummy_mod,newdata = Training3)

## Convert it to a data-frame and this is the final Training set. 
TrainingThree <- data.frame(Training3Dat_Mat)

# Need to preserve this column so I can create testing and training splits.
TrainingThree <- cbind(TrainingThree,Training3$PH)
## Rename the Training3Ph column name to just PH 

TrainingThree <- TrainingThree %>%
  rename(PH = `Training3$PH`)

Machine-Learning Algorithms

Here we will create,train and tune a bunch of models with the data I have cleaned.

PLS model

library(caret)
## Let's fit a PLS model..
set.seed(15)
# We are going to use a simple 10-fold cross validation for our resampling purposes. and make training and testing set here..
ctrl <- trainControl(method = "cv",number = 10)
train_indicies <- createDataPartition(TrainingThree$PH,p = 0.8,list = FALSE)
df_train <- TrainingThree[train_indicies,]
df_test <- TrainingThree[-train_indicies,]
set.seed(100)
plsTune <- train(PH~.,data = df_train,method = "pls",tuneLength = 30,trControl = ctrl,preProc = c("center","scale"))
plsTune
## Partial Least Squares 
## 
## 1631 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1467, 1468, 1468, 1467, 1468, 1469, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1468900  0.2676190  0.1163011
##    2     0.1377369  0.3572221  0.1079962
##    3     0.1348145  0.3841050  0.1069736
##    4     0.1328561  0.4027399  0.1044903
##    5     0.1321517  0.4083418  0.1042786
##    6     0.1314303  0.4146814  0.1030792
##    7     0.1309716  0.4187563  0.1028515
##    8     0.1307266  0.4206853  0.1024522
##    9     0.1306061  0.4214061  0.1022479
##   10     0.1306310  0.4213306  0.1021473
##   11     0.1305677  0.4220748  0.1020323
##   12     0.1305373  0.4224209  0.1019663
##   13     0.1305242  0.4225067  0.1019982
##   14     0.1305053  0.4226692  0.1019040
##   15     0.1305952  0.4218776  0.1019349
##   16     0.1305276  0.4225688  0.1018329
##   17     0.1304762  0.4230303  0.1018234
##   18     0.1303858  0.4240251  0.1016606
##   19     0.1303051  0.4245581  0.1016329
##   20     0.1302389  0.4250852  0.1014709
##   21     0.1301316  0.4259696  0.1012358
##   22     0.1300326  0.4268484  0.1011696
##   23     0.1299492  0.4276008  0.1011171
##   24     0.1298814  0.4282063  0.1009655
##   25     0.1297590  0.4292651  0.1009255
##   26     0.1297082  0.4295870  0.1009541
##   27     0.1296893  0.4297161  0.1009171
##   28     0.1297087  0.4295524  0.1009449
##   29     0.1297126  0.4295741  0.1009660
##   30     0.1297195  0.4295478  0.1009338
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 27.
ggplot(plsTune)

The # of components seems to have converge around the last few components.

summary(plsTune)
## Data:    X dimension: 1631 35 
##  Y dimension: 1631 1
## Fit method: oscorespls
## Number of components considered: 27
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           17.03    28.07    43.61    46.97    51.20    53.93    56.90
## .outcome    27.08    36.39    39.24    41.47    42.26    42.98    43.38
##           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X           59.52    62.88     66.65     70.04     72.60     75.34     77.75
## .outcome    43.62    43.74     43.81     43.87     43.93     43.97     44.01
##           15 comps  16 comps  17 comps  18 comps  19 comps  20 comps  21 comps
## X            79.57     81.45     83.69     85.19     86.68     88.69     90.06
## .outcome     44.12     44.27     44.36     44.44     44.49     44.53     44.62
##           22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## X            91.23     92.49     93.74     95.10     96.15     96.74
## .outcome     44.71     44.74     44.76     44.79     44.82     44.84
plsPred <- predict(plsTune,df_test)
postResample(plsPred,df_test$PH)
##       RMSE   Rsquared        MAE 
## 0.12603288 0.43544849 0.09541264

The PLS model seems to perform well on the test data set and the R-squared is somewhat high and the RMSE is somewhat low.


PCR

Let’s attempt to perform a Principal Component Regression Model with the Caret Package.

set.seed(100)
pcrTune <- train(PH~.,data = df_train,method = "pcr",tuneLength = 30,trControl = ctrl,preProc = c("center","scale"))
pcrTune
## Principal Component Analysis 
## 
## 1631 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1467, 1468, 1468, 1467, 1468, 1469, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##    1     0.1695959  0.02526793  0.1357616
##    2     0.1557945  0.17618815  0.1233237
##    3     0.1556513  0.17715416  0.1233658
##    4     0.1543764  0.19027955  0.1227607
##    5     0.1506671  0.22903567  0.1194065
##    6     0.1450333  0.28605439  0.1154998
##    7     0.1415528  0.32023102  0.1129494
##    8     0.1411020  0.32463825  0.1125569
##    9     0.1406065  0.33003830  0.1120909
##   10     0.1403551  0.33245317  0.1119522
##   11     0.1401127  0.33495508  0.1115769
##   12     0.1394612  0.34050381  0.1109875
##   13     0.1393994  0.34104795  0.1108544
##   14     0.1394052  0.34122174  0.1108851
##   15     0.1388640  0.34615077  0.1106853
##   16     0.1377597  0.35596527  0.1090866
##   17     0.1358703  0.37427317  0.1074072
##   18     0.1354931  0.37779555  0.1069131
##   19     0.1354720  0.37833940  0.1069085
##   20     0.1340224  0.39234136  0.1059181
##   21     0.1340602  0.39183191  0.1059894
##   22     0.1341623  0.39091084  0.1060099
##   23     0.1338306  0.39368347  0.1059302
##   24     0.1333390  0.39841794  0.1054552
##   25     0.1323154  0.40686339  0.1044094
##   26     0.1323778  0.40635365  0.1044544
##   27     0.1305801  0.42197930  0.1021193
##   28     0.1305891  0.42186560  0.1021181
##   29     0.1306906  0.42096645  0.1022790
##   30     0.1308241  0.41985419  0.1022839
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 27.
ggplot(pcrTune)

It seems the lowest RMSE was at the 27th component.

pcrPred <- predict(pcrTune,df_test)
postResample(pcrPred,df_test$PH)
##       RMSE   Rsquared        MAE 
## 0.12769757 0.42112115 0.09685311

The PCR seems no different than from the pls regression, the pcr performed slightly worse than the pls though.


Let’s try to incorporate neural-networks and MARS in this section


Neural Networks

Neural Networks have a tendency to overfit the relationships between the predictors and the response due to the large numbers of regression coefficients you can prevent with this with early stopping and to use weight decay

lambda should be between 0 and 0.1 and should be scaled and centered

Try to perform model-averaging, since it seems to provide better results according to the textbook.

## Find highly correlated columns and remove them from the data 
tooCorrelated <- findCorrelation(cor(df_train),cutoff = .75)
trainXnet <- df_train[,-tooCorrelated]
testXnet <- df_test[,-tooCorrelated]

I mostly followed the textbook and tried to remove highly correlated predictors from my trainingset before I fed it into the neural network.

## Let's create a bunch of candidate set of models to evaluate
nnetGrid <- expand.grid(.decay = c(0,0.01,0.1),.size = c(1:5),.bag = FALSE)
## Let's train a Neural Network Model and tweak the settings. I halved the parameters since it takes the neural network a while to make
set.seed(100)
library(doParallel)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
## Loading required package: parallel
nnetModel <- train(PH~.,data = trainXnet,method = "avNNet",tuneGrid = nnetGrid,trControl = ctrl,preProc = c("center","scale"),linout = TRUE,trace = FALSE,MaxNWts = 5 * (ncol(trainXnet)+ 1) + 5 + 1,maxit = 250)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetModel
## Model Averaged Neural Network 
## 
## 1631 samples
##   23 predictor
## 
## Pre-processing: centered (23), scaled (23) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1467, 1468, 1468, 1467, 1468, 1469, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE       
##   0.00   1     0.1364979  0.3684368  0.10803184
##   0.00   2     0.1320085  0.4203298  0.10361097
##   0.00   3     0.1675585  0.3552124  0.10658585
##   0.00   4     0.1538716  0.4045627  0.09956678
##   0.00   5     0.2402179  0.4938463  0.10758527
##   0.01   1     0.1365235  0.3680112  0.10820210
##   0.01   2     0.1281816  0.4424541  0.10157958
##   0.01   3     0.1222415  0.4935821  0.09608904
##   0.01   4     0.1200936  0.5113575  0.09429327
##   0.01   5     0.1155865  0.5459104  0.08993168
##   0.10   1     0.1374366  0.3605308  0.10911843
##   0.10   2     0.1270508  0.4532412  0.09985821
##   0.10   3     0.1224608  0.4918372  0.09620079
##   0.10   4     0.1195943  0.5157582  0.09308668
##   0.10   5     0.1167379  0.5380169  0.09006987
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.01 and bag = FALSE.
ggplot(nnetModel)

Looking at the plot it seems the two weight decay are very close together with 0.01 winning at the fifth hidden unit.

nnetPred <- predict(nnetModel,testXnet)
postResample(nnetPred,testXnet$PH)
##       RMSE   Rsquared        MAE 
## 0.11356806 0.54208364 0.08800004

The neural network seems to perform well with the current tweaking of the parameters and performed better than the pls and pcr model.

MARS Model

Let’s create a MARS model now.. There are two tuning parameters, the degree of the features that are added to the model and the # of retained terms, the latter can be determined by the default pruning procedure (GCV), model summary on page 149 (gotta reread this section again..) and nprune is the # of retained terms. (As noted by the textbook.)

# Stick with the default resampling and use 2:30 to see how much terms get retained.
set.seed(100)
Marsgrid <- expand.grid(degree = 1:3,nprune = 2:30)
Marsmodel <- train(PH~.,data = df_train,method = "earth",tuneGrid = Marsgrid,trControl = ctrl)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.3.2
ggplot(Marsmodel)

It seems like high degrees and large prune values seems to capture most of the variance and produce really low RMSE values.

Marspred <- predict(Marsmodel,df_test)
postResample(Marspred,df_test$PH)
##       RMSE   Rsquared        MAE 
## 0.13270590 0.42236421 0.09069764

Seems like the Mars Model performed worse than the neural network, I am wondering why?, was it because of the one hot encoding used on the brand_code?


Finally, let’s create a random forest/ boosting model


Random Forest

Let’s try and create a randomforest model, two parameters that can be tuned are the complexity parameter and the maximum node depth

## Breiman suggest tuning the mtry to 1/3 of the total predictors, start with 5 random values of k, and we will use a small # of trees.. (pg 200)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(100)
rfGrid <- expand.grid(mtry = c(2,4,6,8,10))
RandomF <- train(PH~.,data = df_train,trControl = ctrl,tuneGrid = rfGrid,method = "rf")
RandomF
## Random Forest 
## 
## 1631 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1467, 1468, 1468, 1467, 1468, 1469, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   MAE       
##    2    0.10740741  0.6551531  0.08186186
##    4    0.10073141  0.6874531  0.07543571
##    6    0.09862468  0.6943415  0.07337440
##    8    0.09739290  0.6984954  0.07201076
##   10    0.09687985  0.6983824  0.07157267
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 10.
ggplot(RandomF)

Seems like it using between eight to ten predictors per split seems to make slight reduction on the RMSE values..

ggplot(varImp(RandomF,scale = TRUE))

Seems like Mnf Flow and Usage Cont and Brand Code C were important predictors in constructing the random forest model.

RFpred <- predict(RandomF,df_test)
postResample(RFpred,df_test$PH)
##       RMSE   Rsquared        MAE 
## 0.08998589 0.73377471 0.06453793

Not surprisingly, the randomForest did seems to perform really well on the test-data, I am going to assume that boosted trees and xgboost will perform even better on this type of data.

Boosted Trees

Let’s try to create boosted trees now.

## I will start with a simple interaction depth, and use 100 iterations of tree between 100 to 1000 and use the two shrinkage values as noted in the textbook. n.minobsinnode is the integer sepecfiying the min number of observations in the terminal node of the tree
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 100),
.shrinkage = c(0.01, 0.1),.n.minobsinnode = 10)
set.seed(100)
gbmModel <- train(PH~.,data = df_train,tuneGrid = gbmGrid,method = "gbm",verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting 
## 
## 1631 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1631, 1631, 1631, 1631, 1631, 1631, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE       
##   0.01       1                   100     0.1526692  0.3110343  0.12136449
##   0.01       1                   200     0.1440630  0.3570009  0.11403388
##   0.01       1                   300     0.1395674  0.3787609  0.11037140
##   0.01       1                   400     0.1369548  0.3921618  0.10826390
##   0.01       1                   500     0.1352405  0.4023632  0.10680884
##   0.01       1                   600     0.1338960  0.4110923  0.10567195
##   0.01       1                   700     0.1328554  0.4179524  0.10478901
##   0.01       1                   800     0.1319617  0.4241548  0.10401176
##   0.01       1                   900     0.1312313  0.4290510  0.10334544
##   0.01       1                  1000     0.1306156  0.4334061  0.10277306
##   0.01       3                   100     0.1417338  0.4165970  0.11255890
##   0.01       3                   200     0.1318490  0.4520906  0.10427643
##   0.01       3                   300     0.1274802  0.4751170  0.10047647
##   0.01       3                   400     0.1248023  0.4916002  0.09806268
##   0.01       3                   500     0.1228341  0.5042145  0.09622447
##   0.01       3                   600     0.1213781  0.5134974  0.09482033
##   0.01       3                   700     0.1202058  0.5211239  0.09364531
##   0.01       3                   800     0.1192501  0.5272987  0.09269604
##   0.01       3                   900     0.1184998  0.5321027  0.09190949
##   0.01       3                  1000     0.1178344  0.5364405  0.09121388
##   0.01       5                   100     0.1376373  0.4588917  0.10921212
##   0.01       5                   200     0.1267420  0.4962076  0.09983185
##   0.01       5                   300     0.1219315  0.5199665  0.09549226
##   0.01       5                   400     0.1190783  0.5362044  0.09275528
##   0.01       5                   500     0.1172812  0.5464077  0.09096722
##   0.01       5                   600     0.1159004  0.5547584  0.08954558
##   0.01       5                   700     0.1148201  0.5615465  0.08842717
##   0.01       5                   800     0.1140210  0.5665256  0.08757618
##   0.01       5                   900     0.1133441  0.5707987  0.08685144
##   0.01       5                  1000     0.1127675  0.5744251  0.08627229
##   0.01       7                   100     0.1348587  0.4895175  0.10697044
##   0.01       7                   200     0.1231846  0.5262075  0.09664760
##   0.01       7                   300     0.1180758  0.5498012  0.09183438
##   0.01       7                   400     0.1152721  0.5643821  0.08907335
##   0.01       7                   500     0.1134272  0.5749298  0.08723523
##   0.01       7                   600     0.1121518  0.5824252  0.08595086
##   0.01       7                   700     0.1112109  0.5880469  0.08496070
##   0.01       7                   800     0.1104661  0.5925434  0.08422219
##   0.01       7                   900     0.1098496  0.5963808  0.08359064
##   0.01       7                  1000     0.1093491  0.5994872  0.08310536
##   0.10       1                   100     0.1305099  0.4333978  0.10257809
##   0.10       1                   200     0.1276200  0.4525300  0.09933191
##   0.10       1                   300     0.1270397  0.4563182  0.09845740
##   0.10       1                   400     0.1267549  0.4590868  0.09799525
##   0.10       1                   500     0.1267871  0.4593649  0.09787300
##   0.10       1                   600     0.1268248  0.4597729  0.09794830
##   0.10       1                   700     0.1269017  0.4598267  0.09785415
##   0.10       1                   800     0.1269643  0.4597786  0.09792842
##   0.10       1                   900     0.1271069  0.4594556  0.09800464
##   0.10       1                  1000     0.1272713  0.4587749  0.09811610
##   0.10       3                   100     0.1185029  0.5291048  0.09160168
##   0.10       3                   200     0.1157220  0.5485529  0.08868396
##   0.10       3                   300     0.1146298  0.5567415  0.08759488
##   0.10       3                   400     0.1141122  0.5607816  0.08699393
##   0.10       3                   500     0.1135867  0.5651645  0.08644747
##   0.10       3                   600     0.1134131  0.5669755  0.08629430
##   0.10       3                   700     0.1133401  0.5679619  0.08611128
##   0.10       3                   800     0.1132176  0.5693100  0.08603937
##   0.10       3                   900     0.1131356  0.5702326  0.08593373
##   0.10       3                  1000     0.1130289  0.5713845  0.08587400
##   0.10       5                   100     0.1139648  0.5633455  0.08728061
##   0.10       5                   200     0.1114119  0.5811478  0.08478597
##   0.10       5                   300     0.1103988  0.5885609  0.08371115
##   0.10       5                   400     0.1098422  0.5928978  0.08302660
##   0.10       5                   500     0.1095411  0.5952367  0.08270386
##   0.10       5                   600     0.1094348  0.5962698  0.08254339
##   0.10       5                   700     0.1093941  0.5967412  0.08252185
##   0.10       5                   800     0.1093075  0.5975452  0.08236832
##   0.10       5                   900     0.1092578  0.5980269  0.08229006
##   0.10       5                  1000     0.1091901  0.5986895  0.08224825
##   0.10       7                   100     0.1111048  0.5848569  0.08439544
##   0.10       7                   200     0.1089841  0.5993402  0.08226450
##   0.10       7                   300     0.1083579  0.6038712  0.08159423
##   0.10       7                   400     0.1081406  0.6056329  0.08130197
##   0.10       7                   500     0.1079491  0.6071446  0.08108709
##   0.10       7                   600     0.1078999  0.6075971  0.08097651
##   0.10       7                   700     0.1079114  0.6076594  0.08097628
##   0.10       7                   800     0.1078946  0.6078988  0.08090430
##   0.10       7                   900     0.1078638  0.6081523  0.08084664
##   0.10       7                  1000     0.1078653  0.6082244  0.08082077
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 900, interaction.depth =
##  7, shrinkage = 0.1 and n.minobsinnode = 10.
ggplot(gbmModel)

Higher shrinkage values seems to provide lower RMSE values and max tree depth which is seven in this case seems to provide the lowest. But likewise higher tree depth can overfit on the training data.

summary(gbmModel,plot = FALSE)
##                                       var     rel.inf
## X.Usage.cont.               X.Usage.cont. 11.00684722
## X.Mnf.Flow.                   X.Mnf.Flow.  9.20679429
## X.Oxygen.Filler.         X.Oxygen.Filler.  6.03420897
## X.Brand.Code.C             X.Brand.Code.C  5.93282193
## Temperature                   Temperature  5.10305867
## X.Pressure.Vacuum.     X.Pressure.Vacuum.  4.12541630
## X.Filler.Speed.           X.Filler.Speed.  3.90244818
## X.Alch.Rel.                   X.Alch.Rel.  3.56061228
## X.Carb.Pressure1.       X.Carb.Pressure1.  3.42065036
## X.Carb.Flow.                 X.Carb.Flow.  3.26527637
## X.Air.Pressurer.         X.Air.Pressurer.  3.25933862
## MFR                                   MFR  2.93403188
## X.Fill.Ounces.             X.Fill.Ounces.  2.71327568
## X.Carb.Rel.                   X.Carb.Rel.  2.69605154
## X.Filler.Level.           X.Filler.Level.  2.53403137
## X.Balling.Lvl.             X.Balling.Lvl.  2.45734564
## X.PC.Volume.                 X.PC.Volume.  2.42046361
## Balling                           Balling  2.36131478
## X.Hyd.Pressure1.         X.Hyd.Pressure1.  2.01120625
## PSC                                   PSC  2.00426391
## Density                           Density  1.98089320
## X.Carb.Volume.             X.Carb.Volume.  1.94137034
## X.PSC.Fill.                   X.PSC.Fill.  1.91382204
## X.Hyd.Pressure4.         X.Hyd.Pressure4.  1.68388005
## X.Carb.Temp.                 X.Carb.Temp.  1.66439535
## X.Bowl.Setpoint.         X.Bowl.Setpoint.  1.59108707
## X.Carb.Pressure.         X.Carb.Pressure.  1.52859473
## X.Fill.Pressure.         X.Fill.Pressure.  1.52067194
## X.Hyd.Pressure3.         X.Hyd.Pressure3.  1.28726665
## X.Hyd.Pressure2.         X.Hyd.Pressure2.  1.20912989
## X.PSC.CO2.                     X.PSC.CO2.  0.98679336
## X.Pressure.Setpoint. X.Pressure.Setpoint.  0.77517582
## X.Brand.Code.A             X.Brand.Code.A  0.69358775
## X.Brand.Code.B             X.Brand.Code.B  0.26162874
## X.Brand.Code.D             X.Brand.Code.D  0.01224519

It seems like Usage cont and Mnf Flow had a strong influence on the formation of the boosted trees, followed by Brand code C and Oxygen Filler.

GBMpred <- predict(gbmModel,df_test)
postResample(GBMpred,df_test$PH)
##       RMSE   Rsquared        MAE 
## 0.09939005 0.64977854 0.07502457

Seems like the boosting model performed slightly worse than the random forest model, but it still performed well on the testing data.

XGBoost

Finally, I want to try out a popular machine learning algorithm aka xgboost, due to the nature of this algorithm I believe this particular technique will work out the best.

## https://www.youtube.com/watch?v=lwv8GneTUOk&ab_channel=DataGarden
## May have to create a new tuning grid for xgboost.. gamma for 5 is suggested if u have no clue what to do look at article for tuning
xgBgrid <- expand.grid(nrounds = 100,eta = c(0.1,0.3),max_depth = c(2,4,6),colsample_bytree = c(0.5,0.9),subsample = c(0.5,0.8),gamma = 5,min_child_weight = c(0,10))
xgBoostModel <- train(PH~.,data = df_train,method = "xgbTree",objective = "reg:squarederror",trControl = ctrl,tuneGrid = xgBgrid)
ggplot(varImp(xgBoostModel))

xgbPred <- predict(xgBoostModel,df_test)
postResample(xgbPred,df_test$PH)
##      RMSE  Rsquared       MAE 
## 0.1512001 0.2734128 0.1197072

I don’t understand what happened here, but xgboost did not do well on my dataset,I am trying to figure out why but for now I am unsure, I will keep this failed model here just in case.

Create The Table and Compare Results

Table <- bind_rows(pls = postResample(plsPred,df_test$PH),pcr = postResample(pcrPred,df_test$PH),NN = postResample(nnetPred,df_test$PH),Mars = postResample(Marspred,df_test$PH),RF = postResample(RFpred,df_test$PH),Boost = postResample(GBMpred,df_test$PH),xgb = postResample(xgbPred,df_test$PH))
Table$id = c("PLS","PCR","Neural Network","MARS",
             "Random Forest","Boosted Trees","xgBoost")
Table
## # A tibble: 7 × 4
##     RMSE Rsquared    MAE id            
##    <dbl>    <dbl>  <dbl> <chr>         
## 1 0.126     0.435 0.0954 PLS           
## 2 0.128     0.421 0.0969 PCR           
## 3 0.114     0.542 0.0880 Neural Network
## 4 0.133     0.422 0.0907 MARS          
## 5 0.0900    0.734 0.0645 Random Forest 
## 6 0.0994    0.650 0.0750 Boosted Trees 
## 7 0.151     0.273 0.120  xgBoost

Comparing all the metrics we can see that Random Forest and Boosted Trees seems to have performed the best on the test data, unfortunately xgBoost did not work in this case and it seem to have failed. Even though this is the algorithm that is popular among many data scientists.


Test -Set

I will now apply my models on the test set from the other excel file and then import them into an excel file.

But first, we will have to read and clean the data

Test <- read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\StudentEvaluation.xlsx")

Test
## # A tibble: 267 × 33
##    `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##    <chr>                <dbl>         <dbl>       <dbl>           <dbl>
##  1 D                     5.48          24.0       0.27             65.4
##  2 A                     5.39          24.0       0.227            63.2
##  3 B                     5.29          23.9       0.303            66.4
##  4 B                     5.27          23.9       0.186            64.8
##  5 B                     5.41          24.2       0.16             69.4
##  6 B                     5.29          24.1       0.212            73.4
##  7 A                     5.48          23.9       0.243            65.2
##  8 B                     5.42          24.1       0.123            67.4
##  9 A                     5.41          23.9       0.333            66.8
## 10 D                     5.47          24.0       0.256            72.6
## # ℹ 257 more rows
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>, …

Now I have to perform the same data-cleaning and preprocessing procedures,

str(Test)
## tibble [267 × 33] (S3: tbl_df/tbl/data.frame)
##  $ Brand Code       : chr [1:267] "D" "A" "B" "B" ...
##  $ Carb Volume      : num [1:267] 5.48 5.39 5.29 5.27 5.41 ...
##  $ Fill Ounces      : num [1:267] 24 24 23.9 23.9 24.2 ...
##  $ PC Volume        : num [1:267] 0.27 0.227 0.303 0.186 0.16 ...
##  $ Carb Pressure    : num [1:267] 65.4 63.2 66.4 64.8 69.4 73.4 65.2 67.4 66.8 72.6 ...
##  $ Carb Temp        : num [1:267] 135 135 140 139 142 ...
##  $ PSC              : num [1:267] 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
##  $ PSC Fill         : num [1:267] 0.4 0.22 0.1 0.2 0.3 ...
##  $ PSC CO2          : num [1:267] 0.04 0.08 0.02 0.02 0.06 ...
##  $ Mnf Flow         : num [1:267] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb Pressure1   : num [1:267] 117 119 120 125 115 ...
##  $ Fill Pressure    : num [1:267] 46 46.2 45.8 40 51.4 46.4 46.2 40 43.8 40.8 ...
##  $ Hyd Pressure1    : num [1:267] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure2    : num [1:267] NA 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure3    : num [1:267] NA 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd Pressure4    : num [1:267] 96 112 98 132 94 94 108 108 110 106 ...
##  $ Filler Level     : num [1:267] 129 120 119 120 116 ...
##  $ Filler Speed     : num [1:267] 3986 4012 4010 NA 4018 ...
##  $ Temperature      : num [1:267] 66 65.6 65.6 74.4 66.4 66.6 66.8 NA 65.8 66 ...
##  $ Usage cont       : num [1:267] 21.7 17.6 24.2 18.1 21.3 ...
##  $ Carb Flow        : num [1:267] 2950 2916 3056 28 3214 ...
##  $ Density          : num [1:267] 0.88 1.5 0.9 0.74 0.88 0.84 1.48 1.6 1.52 1.48 ...
##  $ MFR              : num [1:267] 728 736 735 NA 752 ...
##  $ Balling          : num [1:267] 1.4 2.94 1.45 1.06 1.4 ...
##  $ Pressure Vacuum  : num [1:267] -3.8 -4.4 -4.2 -4 -4 -3.8 -4.2 -4.4 -4.4 -4.2 ...
##  $ PH               : logi [1:267] NA NA NA NA NA NA ...
##  $ Oxygen Filler    : num [1:267] 0.022 0.03 0.046 NA 0.082 0.064 0.042 0.096 0.046 0.096 ...
##  $ Bowl Setpoint    : num [1:267] 130 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure Setpoint: num [1:267] 45.2 46 46 46 50 46 46 46 46 46 ...
##  $ Air Pressurer    : num [1:267] 143 147 147 146 146 ...
##  $ Alch Rel         : num [1:267] 6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
##  $ Carb Rel         : num [1:267] 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
##  $ Balling Lvl      : num [1:267] 1.48 3.04 1.46 1.48 1.46 1.44 3.02 3 3.1 3.12 ...

Glancinng at the test data, it more or less resembles the structure of the training set, and thus I will clean and preprocess the data the same way.

Clean/Preprocess The Data

# Training3 is going to be the final cleaned up version 
Test2 <- Test

Should preprocess the data into training/testing before I actually applied it to the training set.

# Create the dummy variables using predict and the Y variable (PH) will not be available in the trainingset
library(caret)
dummy_mod1 <- dummyVars(PH~.,data = Test2)
Test2Dat_Mat <- predict(dummy_mod1,newdata = Test2)

## Convert it to a data-frame and this is the final Training set. 
TestTwo <- data.frame(Test2Dat_Mat)


## I will remove the NA values from the test set..
TestTwo <- TestTwo %>%
  na.omit()

Test Model on Cleaned Test Data

I tested my models before with NA values but the model seems to spit out results that only has columns with values in it.

PlsTest <- predict(plsTune,TestTwo)
pcRTest <- predict(pcrTune,TestTwo)
NeuralTest <- predict(nnetModel,TestTwo)
MarsTest <- predict(Marsmodel,TestTwo)
RFTest <- predict(RandomF,TestTwo)
BoostTest <- predict(gbmModel,TestTwo)
xgBoostTest <- predict(xgBoostModel,TestTwo)
TestTable <- bind_cols("PLS" = PlsTest,"PCR" = pcRTest,"Neural Network" = NeuralTest,"MARS" = MarsTest,"Random Forest" = RFTest,"Boosted Tree" = BoostTest,"xgBoost" = xgBoostTest)

TestTable
## # A tibble: 200 × 7
##      PLS   PCR `Neural Network` MARS[,"y"] `Random Forest` `Boosted Tree`
##    <dbl> <dbl>            <dbl>      <dbl>           <dbl>          <dbl>
##  1  8.61  8.62             8.42       8.58            8.50           8.46
##  2  8.62  8.62             8.51       8.56            8.54           8.51
##  3  8.50  8.51             8.49       8.54            8.51           8.44
##  4  8.57  8.56             8.48       8.50            8.48           8.51
##  5  8.69  8.67             8.55       8.57            8.56           8.67
##  6  8.60  8.62             8.42       8.56            8.54           8.49
##  7  8.60  8.60             8.54       8.60            8.54           8.58
##  8  8.62  8.62             8.76       8.70            8.60           8.65
##  9  8.37  8.39             8.28       8.19            8.26           8.20
## 10  8.60  8.61             8.60       8.57            8.58           8.55
## # ℹ 190 more rows
## # ℹ 1 more variable: xgBoost <dbl>

Write to Excel

library(writexl)
## Warning: package 'writexl' was built under R version 4.3.3
write_xlsx(TestTable,"C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\Finalpredictions.xlsx")

Thoughts on Process

After cleaning and creating a variety of models from this data set, it seems like using Random Forest and Boosted Trees are the best options to use on the test set since they performed really well in predicting PH values. In a business setting, if we took a look at the variable importance when constructing the tree, we note that in the randomf forest MnF Flow and Usage Cont and Brand Code C seems to be important predictors when predicting PH values, These same predictors were also found when constructing the Boosted Trees. While there is not enough domain knowledge in hand, I would assume that the PH values were possibly influenced by how Brand C measured their PH?, and how the other key predictors has had an important hand in their PH values. Either way I would possibly tell stakeholders or the higher-ups that we should take careful consideration on how we handle Mnf Flow and Usage Cont and to possibly learn more on how Brand Code C handles their Ph measurements or values. Note I am really dissapointed with how my xgboost model turned out I was really hoping it would outperform the random forest and the boosted trees but I wanted to keep this model in to showcase that not everything goes right in these sort of machine-learning workflow. Testing the models on the actual testing set, I noticed that if there were any Na values in anyone of the columns the model would seem to ignore giving the Ph values for those particular observations, hance why the predictions in the excel file only contains 200 predictions instead of the usual 267. Ultimately, I would choose the Random Forest model because it performed really well on the training data and the Rsquared and RMSE metrics are good.