For this second project I am going to create a bunch of models using the Caret Library, since I am interested in playing different values to tune the many kinds of models I want to make with this dataset, for the resampling method I stuck with a simple 10-fold cross validation since the dataset is rather small and it is a rather safe option to cross-validate the data. I will train a bunch of models and see how well it performs on the dataset and from there I will simply choose the best model metrics.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
Training <- read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\StudentData.xlsx")
head(Training)
## # A tibble: 6 × 33
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B 5.34 24.0 0.263 68.2
## 2 A 5.43 24.0 0.239 68.4
## 3 B 5.29 24.1 0.263 70.8
## 4 A 5.44 24.0 0.293 63
## 5 A 5.49 24.3 0.111 67.2
## 6 A 5.38 23.9 0.269 66.6
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## # `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## # `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## # `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## # `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## # `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## # `Pressure Vacuum` <dbl>, PH <dbl>, `Oxygen Filler` <dbl>, …
str(Training)
## tibble [2,571 × 33] (S3: tbl_df/tbl/data.frame)
## $ Brand Code : chr [1:2571] "B" "A" "B" "A" ...
## $ Carb Volume : num [1:2571] 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill Ounces : num [1:2571] 24 24 24.1 24 24.3 ...
## $ PC Volume : num [1:2571] 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb Pressure : num [1:2571] 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb Temp : num [1:2571] 141 140 145 133 137 ...
## $ PSC : num [1:2571] 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC Fill : num [1:2571] 0.26 0.22 0.34 0.42 0.16 ...
## $ PSC CO2 : num [1:2571] 0.04 0.04 0.16 0.04 0.12 ...
## $ Mnf Flow : num [1:2571] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num [1:2571] 119 122 120 115 118 ...
## $ Fill Pressure : num [1:2571] 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd Pressure1 : num [1:2571] 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num [1:2571] NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num [1:2571] 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler Level : num [1:2571] 121 119 120 118 119 ...
## $ Filler Speed : num [1:2571] 4002 3986 4020 4012 4010 ...
## $ Temperature : num [1:2571] 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage cont : num [1:2571] 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb Flow : num [1:2571] 2932 3144 2914 3062 3054 ...
## $ Density : num [1:2571] 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num [1:2571] 725 727 735 731 723 ...
## $ Balling : num [1:2571] 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure Vacuum : num [1:2571] -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num [1:2571] 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen Filler : num [1:2571] 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl Setpoint : num [1:2571] 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num [1:2571] 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air Pressurer : num [1:2571] 143 143 142 146 146 ...
## $ Alch Rel : num [1:2571] 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb Rel : num [1:2571] 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling Lvl : num [1:2571] 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
Glancing at the structure of the data, we notice that there are few missing values and potentially have some predictors that are of the same value??, Also is there any significance of the brand code?
I’ve figured out I might as well remove NA values since we don’t lose that many observations within the training set, also looking at the training set the missing values are completeley at random and there is no fixed pattern of missingness.
Training1 <- Training %>%
na.omit()
Training1 <- Training1 %>%
select_if(is.numeric)
Training1 <- Training1 %>%
pivot_longer(colnames(Training1)) %>%
as.data.frame()
ggplot(Training1,aes(x = value)) +
geom_histogram() +
facet_wrap(~name,scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Some of the columns looks skewed,while others are sparse and a few looks normally distributed but slightly skewed.. it appears Hyd_Press 1,2, and 3 appears to have a lot of zeros in their distribution, this is kinda worrisome those columns seems to have sparse data.
Training %>%
filter(!is.na(`Brand Code`)) %>%
ggplot(aes(x = `Brand Code`)) +
geom_bar() + labs(title = "Proportion of Brand Code Within The Training Set")
It appears most of the Brand Code are B followed by D, A and C seems to be of equal proportion.
–
Let’s take a look at the correlation matrix and see what works.
## Correlation_Matrix only works if there are no empty data and the predictors are numeric.
Training2 <- Training %>%
select_if(is.numeric) %>%
na.omit()
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Training2),method = "circle")
Looking at the correlation plot, we can see some predictors are strongly correlated with other predictors i.e we see that carb volume is correlated with density, balling ,AlCh Rel and etc, Interestingly the ph(our response variable doesn’t seem to have any strong correlation with the predictors just a negative correlation..)
Since the Brand Code column is essentially a character column I figure I may as well create one-hot encoding for this column since it may be wise to preserve as much of the variables as I can. and most ml algorithms require columns to be numeric.
# Training3 is going to be the final cleaned up version
Training3 <- Training %>%
na.omit()
Should preprocess the data into training/testing before I actually applied it to the test set.
# Create the dummy variables using predict and the Y variable (PH) will not be available in the trainingset
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
dummy_mod <- dummyVars(PH~.,data = Training3)
Training3Dat_Mat <- predict(dummy_mod,newdata = Training3)
## Convert it to a data-frame and this is the final Training set.
TrainingThree <- data.frame(Training3Dat_Mat)
# Need to preserve this column so I can create testing and training splits.
TrainingThree <- cbind(TrainingThree,Training3$PH)
## Rename the Training3Ph column name to just PH
TrainingThree <- TrainingThree %>%
rename(PH = `Training3$PH`)
Here we will create,train and tune a bunch of models with the data I have cleaned.
library(caret)
## Let's fit a PLS model..
set.seed(15)
# We are going to use a simple 10-fold cross validation for our resampling purposes. and make training and testing set here..
ctrl <- trainControl(method = "cv",number = 10)
train_indicies <- createDataPartition(TrainingThree$PH,p = 0.8,list = FALSE)
df_train <- TrainingThree[train_indicies,]
df_test <- TrainingThree[-train_indicies,]
set.seed(100)
plsTune <- train(PH~.,data = df_train,method = "pls",tuneLength = 30,trControl = ctrl,preProc = c("center","scale"))
plsTune
## Partial Least Squares
##
## 1631 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1467, 1468, 1468, 1467, 1468, 1469, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1468900 0.2676190 0.1163011
## 2 0.1377369 0.3572221 0.1079962
## 3 0.1348145 0.3841050 0.1069736
## 4 0.1328561 0.4027399 0.1044903
## 5 0.1321517 0.4083418 0.1042786
## 6 0.1314303 0.4146814 0.1030792
## 7 0.1309716 0.4187563 0.1028515
## 8 0.1307266 0.4206853 0.1024522
## 9 0.1306061 0.4214061 0.1022479
## 10 0.1306310 0.4213306 0.1021473
## 11 0.1305677 0.4220748 0.1020323
## 12 0.1305373 0.4224209 0.1019663
## 13 0.1305242 0.4225067 0.1019982
## 14 0.1305053 0.4226692 0.1019040
## 15 0.1305952 0.4218776 0.1019349
## 16 0.1305276 0.4225688 0.1018329
## 17 0.1304762 0.4230303 0.1018234
## 18 0.1303858 0.4240251 0.1016606
## 19 0.1303051 0.4245581 0.1016329
## 20 0.1302389 0.4250852 0.1014709
## 21 0.1301316 0.4259696 0.1012358
## 22 0.1300326 0.4268484 0.1011696
## 23 0.1299492 0.4276008 0.1011171
## 24 0.1298814 0.4282063 0.1009655
## 25 0.1297590 0.4292651 0.1009255
## 26 0.1297082 0.4295870 0.1009541
## 27 0.1296893 0.4297161 0.1009171
## 28 0.1297087 0.4295524 0.1009449
## 29 0.1297126 0.4295741 0.1009660
## 30 0.1297195 0.4295478 0.1009338
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 27.
ggplot(plsTune)
The # of components seems to have converge around the last few components.
summary(plsTune)
## Data: X dimension: 1631 35
## Y dimension: 1631 1
## Fit method: oscorespls
## Number of components considered: 27
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 17.03 28.07 43.61 46.97 51.20 53.93 56.90
## .outcome 27.08 36.39 39.24 41.47 42.26 42.98 43.38
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 59.52 62.88 66.65 70.04 72.60 75.34 77.75
## .outcome 43.62 43.74 43.81 43.87 43.93 43.97 44.01
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps
## X 79.57 81.45 83.69 85.19 86.68 88.69 90.06
## .outcome 44.12 44.27 44.36 44.44 44.49 44.53 44.62
## 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## X 91.23 92.49 93.74 95.10 96.15 96.74
## .outcome 44.71 44.74 44.76 44.79 44.82 44.84
plsPred <- predict(plsTune,df_test)
postResample(plsPred,df_test$PH)
## RMSE Rsquared MAE
## 0.12603288 0.43544849 0.09541264
The PLS model seems to perform well on the test data set and the R-squared is somewhat high and the RMSE is somewhat low.
Let’s attempt to perform a Principal Component Regression Model with the Caret Package.
set.seed(100)
pcrTune <- train(PH~.,data = df_train,method = "pcr",tuneLength = 30,trControl = ctrl,preProc = c("center","scale"))
pcrTune
## Principal Component Analysis
##
## 1631 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1467, 1468, 1468, 1467, 1468, 1469, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1695959 0.02526793 0.1357616
## 2 0.1557945 0.17618815 0.1233237
## 3 0.1556513 0.17715416 0.1233658
## 4 0.1543764 0.19027955 0.1227607
## 5 0.1506671 0.22903567 0.1194065
## 6 0.1450333 0.28605439 0.1154998
## 7 0.1415528 0.32023102 0.1129494
## 8 0.1411020 0.32463825 0.1125569
## 9 0.1406065 0.33003830 0.1120909
## 10 0.1403551 0.33245317 0.1119522
## 11 0.1401127 0.33495508 0.1115769
## 12 0.1394612 0.34050381 0.1109875
## 13 0.1393994 0.34104795 0.1108544
## 14 0.1394052 0.34122174 0.1108851
## 15 0.1388640 0.34615077 0.1106853
## 16 0.1377597 0.35596527 0.1090866
## 17 0.1358703 0.37427317 0.1074072
## 18 0.1354931 0.37779555 0.1069131
## 19 0.1354720 0.37833940 0.1069085
## 20 0.1340224 0.39234136 0.1059181
## 21 0.1340602 0.39183191 0.1059894
## 22 0.1341623 0.39091084 0.1060099
## 23 0.1338306 0.39368347 0.1059302
## 24 0.1333390 0.39841794 0.1054552
## 25 0.1323154 0.40686339 0.1044094
## 26 0.1323778 0.40635365 0.1044544
## 27 0.1305801 0.42197930 0.1021193
## 28 0.1305891 0.42186560 0.1021181
## 29 0.1306906 0.42096645 0.1022790
## 30 0.1308241 0.41985419 0.1022839
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 27.
ggplot(pcrTune)
It seems the lowest RMSE was at the 27th component.
pcrPred <- predict(pcrTune,df_test)
postResample(pcrPred,df_test$PH)
## RMSE Rsquared MAE
## 0.12769757 0.42112115 0.09685311
The PCR seems no different than from the pls regression, the pcr performed slightly worse than the pls though.
Let’s try to incorporate neural-networks and MARS in this section
Neural Networks have a tendency to overfit the relationships between the predictors and the response due to the large numbers of regression coefficients you can prevent with this with early stopping and to use weight decay
lambda should be between 0 and 0.1 and should be scaled and centered
Try to perform model-averaging, since it seems to provide better results according to the textbook.
## Find highly correlated columns and remove them from the data
tooCorrelated <- findCorrelation(cor(df_train),cutoff = .75)
trainXnet <- df_train[,-tooCorrelated]
testXnet <- df_test[,-tooCorrelated]
I mostly followed the textbook and tried to remove highly correlated predictors from my trainingset before I fed it into the neural network.
## Let's create a bunch of candidate set of models to evaluate
nnetGrid <- expand.grid(.decay = c(0,0.01,0.1),.size = c(1:5),.bag = FALSE)
## Let's train a Neural Network Model and tweak the settings. I halved the parameters since it takes the neural network a while to make
set.seed(100)
library(doParallel)
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
nnetModel <- train(PH~.,data = trainXnet,method = "avNNet",tuneGrid = nnetGrid,trControl = ctrl,preProc = c("center","scale"),linout = TRUE,trace = FALSE,MaxNWts = 5 * (ncol(trainXnet)+ 1) + 5 + 1,maxit = 250)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetModel
## Model Averaged Neural Network
##
## 1631 samples
## 23 predictor
##
## Pre-processing: centered (23), scaled (23)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1467, 1468, 1468, 1467, 1468, 1469, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.1364979 0.3684368 0.10803184
## 0.00 2 0.1320085 0.4203298 0.10361097
## 0.00 3 0.1675585 0.3552124 0.10658585
## 0.00 4 0.1538716 0.4045627 0.09956678
## 0.00 5 0.2402179 0.4938463 0.10758527
## 0.01 1 0.1365235 0.3680112 0.10820210
## 0.01 2 0.1281816 0.4424541 0.10157958
## 0.01 3 0.1222415 0.4935821 0.09608904
## 0.01 4 0.1200936 0.5113575 0.09429327
## 0.01 5 0.1155865 0.5459104 0.08993168
## 0.10 1 0.1374366 0.3605308 0.10911843
## 0.10 2 0.1270508 0.4532412 0.09985821
## 0.10 3 0.1224608 0.4918372 0.09620079
## 0.10 4 0.1195943 0.5157582 0.09308668
## 0.10 5 0.1167379 0.5380169 0.09006987
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.01 and bag = FALSE.
ggplot(nnetModel)
Looking at the plot it seems the two weight decay are very close together with 0.01 winning at the fifth hidden unit.
nnetPred <- predict(nnetModel,testXnet)
postResample(nnetPred,testXnet$PH)
## RMSE Rsquared MAE
## 0.11356806 0.54208364 0.08800004
The neural network seems to perform well with the current tweaking of the parameters and performed better than the pls and pcr model.
Let’s create a MARS model now.. There are two tuning parameters, the degree of the features that are added to the model and the # of retained terms, the latter can be determined by the default pruning procedure (GCV), model summary on page 149 (gotta reread this section again..) and nprune is the # of retained terms. (As noted by the textbook.)
# Stick with the default resampling and use 2:30 to see how much terms get retained.
set.seed(100)
Marsgrid <- expand.grid(degree = 1:3,nprune = 2:30)
Marsmodel <- train(PH~.,data = df_train,method = "earth",tuneGrid = Marsgrid,trControl = ctrl)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.3.2
ggplot(Marsmodel)
It seems like high degrees and large prune values seems to capture most of the variance and produce really low RMSE values.
Marspred <- predict(Marsmodel,df_test)
postResample(Marspred,df_test$PH)
## RMSE Rsquared MAE
## 0.13270590 0.42236421 0.09069764
Seems like the Mars Model performed worse than the neural network, I am wondering why?, was it because of the one hot encoding used on the brand_code?
Finally, let’s create a random forest/ boosting model
Let’s try and create a randomforest model, two parameters that can be tuned are the complexity parameter and the maximum node depth
## Breiman suggest tuning the mtry to 1/3 of the total predictors, start with 5 random values of k, and we will use a small # of trees.. (pg 200)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(100)
rfGrid <- expand.grid(mtry = c(2,4,6,8,10))
RandomF <- train(PH~.,data = df_train,trControl = ctrl,tuneGrid = rfGrid,method = "rf")
RandomF
## Random Forest
##
## 1631 samples
## 35 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1467, 1468, 1468, 1467, 1468, 1469, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.10740741 0.6551531 0.08186186
## 4 0.10073141 0.6874531 0.07543571
## 6 0.09862468 0.6943415 0.07337440
## 8 0.09739290 0.6984954 0.07201076
## 10 0.09687985 0.6983824 0.07157267
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 10.
ggplot(RandomF)
Seems like it using between eight to ten predictors per split seems to make slight reduction on the RMSE values..
ggplot(varImp(RandomF,scale = TRUE))
Seems like Mnf Flow and Usage Cont and Brand Code C were important predictors in constructing the random forest model.
RFpred <- predict(RandomF,df_test)
postResample(RFpred,df_test$PH)
## RMSE Rsquared MAE
## 0.08998589 0.73377471 0.06453793
Not surprisingly, the randomForest did seems to perform really well on the test-data, I am going to assume that boosted trees and xgboost will perform even better on this type of data.
Let’s try to create boosted trees now.
## I will start with a simple interaction depth, and use 100 iterations of tree between 100 to 1000 and use the two shrinkage values as noted in the textbook. n.minobsinnode is the integer sepecfiying the min number of observations in the terminal node of the tree
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 100),
.shrinkage = c(0.01, 0.1),.n.minobsinnode = 10)
set.seed(100)
gbmModel <- train(PH~.,data = df_train,tuneGrid = gbmGrid,method = "gbm",verbose = FALSE)
gbmModel
## Stochastic Gradient Boosting
##
## 1631 samples
## 35 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1631, 1631, 1631, 1631, 1631, 1631, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 100 0.1526692 0.3110343 0.12136449
## 0.01 1 200 0.1440630 0.3570009 0.11403388
## 0.01 1 300 0.1395674 0.3787609 0.11037140
## 0.01 1 400 0.1369548 0.3921618 0.10826390
## 0.01 1 500 0.1352405 0.4023632 0.10680884
## 0.01 1 600 0.1338960 0.4110923 0.10567195
## 0.01 1 700 0.1328554 0.4179524 0.10478901
## 0.01 1 800 0.1319617 0.4241548 0.10401176
## 0.01 1 900 0.1312313 0.4290510 0.10334544
## 0.01 1 1000 0.1306156 0.4334061 0.10277306
## 0.01 3 100 0.1417338 0.4165970 0.11255890
## 0.01 3 200 0.1318490 0.4520906 0.10427643
## 0.01 3 300 0.1274802 0.4751170 0.10047647
## 0.01 3 400 0.1248023 0.4916002 0.09806268
## 0.01 3 500 0.1228341 0.5042145 0.09622447
## 0.01 3 600 0.1213781 0.5134974 0.09482033
## 0.01 3 700 0.1202058 0.5211239 0.09364531
## 0.01 3 800 0.1192501 0.5272987 0.09269604
## 0.01 3 900 0.1184998 0.5321027 0.09190949
## 0.01 3 1000 0.1178344 0.5364405 0.09121388
## 0.01 5 100 0.1376373 0.4588917 0.10921212
## 0.01 5 200 0.1267420 0.4962076 0.09983185
## 0.01 5 300 0.1219315 0.5199665 0.09549226
## 0.01 5 400 0.1190783 0.5362044 0.09275528
## 0.01 5 500 0.1172812 0.5464077 0.09096722
## 0.01 5 600 0.1159004 0.5547584 0.08954558
## 0.01 5 700 0.1148201 0.5615465 0.08842717
## 0.01 5 800 0.1140210 0.5665256 0.08757618
## 0.01 5 900 0.1133441 0.5707987 0.08685144
## 0.01 5 1000 0.1127675 0.5744251 0.08627229
## 0.01 7 100 0.1348587 0.4895175 0.10697044
## 0.01 7 200 0.1231846 0.5262075 0.09664760
## 0.01 7 300 0.1180758 0.5498012 0.09183438
## 0.01 7 400 0.1152721 0.5643821 0.08907335
## 0.01 7 500 0.1134272 0.5749298 0.08723523
## 0.01 7 600 0.1121518 0.5824252 0.08595086
## 0.01 7 700 0.1112109 0.5880469 0.08496070
## 0.01 7 800 0.1104661 0.5925434 0.08422219
## 0.01 7 900 0.1098496 0.5963808 0.08359064
## 0.01 7 1000 0.1093491 0.5994872 0.08310536
## 0.10 1 100 0.1305099 0.4333978 0.10257809
## 0.10 1 200 0.1276200 0.4525300 0.09933191
## 0.10 1 300 0.1270397 0.4563182 0.09845740
## 0.10 1 400 0.1267549 0.4590868 0.09799525
## 0.10 1 500 0.1267871 0.4593649 0.09787300
## 0.10 1 600 0.1268248 0.4597729 0.09794830
## 0.10 1 700 0.1269017 0.4598267 0.09785415
## 0.10 1 800 0.1269643 0.4597786 0.09792842
## 0.10 1 900 0.1271069 0.4594556 0.09800464
## 0.10 1 1000 0.1272713 0.4587749 0.09811610
## 0.10 3 100 0.1185029 0.5291048 0.09160168
## 0.10 3 200 0.1157220 0.5485529 0.08868396
## 0.10 3 300 0.1146298 0.5567415 0.08759488
## 0.10 3 400 0.1141122 0.5607816 0.08699393
## 0.10 3 500 0.1135867 0.5651645 0.08644747
## 0.10 3 600 0.1134131 0.5669755 0.08629430
## 0.10 3 700 0.1133401 0.5679619 0.08611128
## 0.10 3 800 0.1132176 0.5693100 0.08603937
## 0.10 3 900 0.1131356 0.5702326 0.08593373
## 0.10 3 1000 0.1130289 0.5713845 0.08587400
## 0.10 5 100 0.1139648 0.5633455 0.08728061
## 0.10 5 200 0.1114119 0.5811478 0.08478597
## 0.10 5 300 0.1103988 0.5885609 0.08371115
## 0.10 5 400 0.1098422 0.5928978 0.08302660
## 0.10 5 500 0.1095411 0.5952367 0.08270386
## 0.10 5 600 0.1094348 0.5962698 0.08254339
## 0.10 5 700 0.1093941 0.5967412 0.08252185
## 0.10 5 800 0.1093075 0.5975452 0.08236832
## 0.10 5 900 0.1092578 0.5980269 0.08229006
## 0.10 5 1000 0.1091901 0.5986895 0.08224825
## 0.10 7 100 0.1111048 0.5848569 0.08439544
## 0.10 7 200 0.1089841 0.5993402 0.08226450
## 0.10 7 300 0.1083579 0.6038712 0.08159423
## 0.10 7 400 0.1081406 0.6056329 0.08130197
## 0.10 7 500 0.1079491 0.6071446 0.08108709
## 0.10 7 600 0.1078999 0.6075971 0.08097651
## 0.10 7 700 0.1079114 0.6076594 0.08097628
## 0.10 7 800 0.1078946 0.6078988 0.08090430
## 0.10 7 900 0.1078638 0.6081523 0.08084664
## 0.10 7 1000 0.1078653 0.6082244 0.08082077
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 900, interaction.depth =
## 7, shrinkage = 0.1 and n.minobsinnode = 10.
ggplot(gbmModel)
Higher shrinkage values seems to provide lower RMSE values and max tree depth which is seven in this case seems to provide the lowest. But likewise higher tree depth can overfit on the training data.
summary(gbmModel,plot = FALSE)
## var rel.inf
## X.Usage.cont. X.Usage.cont. 11.00684722
## X.Mnf.Flow. X.Mnf.Flow. 9.20679429
## X.Oxygen.Filler. X.Oxygen.Filler. 6.03420897
## X.Brand.Code.C X.Brand.Code.C 5.93282193
## Temperature Temperature 5.10305867
## X.Pressure.Vacuum. X.Pressure.Vacuum. 4.12541630
## X.Filler.Speed. X.Filler.Speed. 3.90244818
## X.Alch.Rel. X.Alch.Rel. 3.56061228
## X.Carb.Pressure1. X.Carb.Pressure1. 3.42065036
## X.Carb.Flow. X.Carb.Flow. 3.26527637
## X.Air.Pressurer. X.Air.Pressurer. 3.25933862
## MFR MFR 2.93403188
## X.Fill.Ounces. X.Fill.Ounces. 2.71327568
## X.Carb.Rel. X.Carb.Rel. 2.69605154
## X.Filler.Level. X.Filler.Level. 2.53403137
## X.Balling.Lvl. X.Balling.Lvl. 2.45734564
## X.PC.Volume. X.PC.Volume. 2.42046361
## Balling Balling 2.36131478
## X.Hyd.Pressure1. X.Hyd.Pressure1. 2.01120625
## PSC PSC 2.00426391
## Density Density 1.98089320
## X.Carb.Volume. X.Carb.Volume. 1.94137034
## X.PSC.Fill. X.PSC.Fill. 1.91382204
## X.Hyd.Pressure4. X.Hyd.Pressure4. 1.68388005
## X.Carb.Temp. X.Carb.Temp. 1.66439535
## X.Bowl.Setpoint. X.Bowl.Setpoint. 1.59108707
## X.Carb.Pressure. X.Carb.Pressure. 1.52859473
## X.Fill.Pressure. X.Fill.Pressure. 1.52067194
## X.Hyd.Pressure3. X.Hyd.Pressure3. 1.28726665
## X.Hyd.Pressure2. X.Hyd.Pressure2. 1.20912989
## X.PSC.CO2. X.PSC.CO2. 0.98679336
## X.Pressure.Setpoint. X.Pressure.Setpoint. 0.77517582
## X.Brand.Code.A X.Brand.Code.A 0.69358775
## X.Brand.Code.B X.Brand.Code.B 0.26162874
## X.Brand.Code.D X.Brand.Code.D 0.01224519
It seems like Usage cont and Mnf Flow had a strong influence on the formation of the boosted trees, followed by Brand code C and Oxygen Filler.
GBMpred <- predict(gbmModel,df_test)
postResample(GBMpred,df_test$PH)
## RMSE Rsquared MAE
## 0.09939005 0.64977854 0.07502457
Seems like the boosting model performed slightly worse than the random forest model, but it still performed well on the testing data.
Finally, I want to try out a popular machine learning algorithm aka xgboost, due to the nature of this algorithm I believe this particular technique will work out the best.
## https://www.youtube.com/watch?v=lwv8GneTUOk&ab_channel=DataGarden
## May have to create a new tuning grid for xgboost.. gamma for 5 is suggested if u have no clue what to do look at article for tuning
xgBgrid <- expand.grid(nrounds = 100,eta = c(0.1,0.3),max_depth = c(2,4,6),colsample_bytree = c(0.5,0.9),subsample = c(0.5,0.8),gamma = 5,min_child_weight = c(0,10))
xgBoostModel <- train(PH~.,data = df_train,method = "xgbTree",objective = "reg:squarederror",trControl = ctrl,tuneGrid = xgBgrid)
ggplot(varImp(xgBoostModel))
xgbPred <- predict(xgBoostModel,df_test)
postResample(xgbPred,df_test$PH)
## RMSE Rsquared MAE
## 0.1512001 0.2734128 0.1197072
I don’t understand what happened here, but xgboost did not do well on my dataset,I am trying to figure out why but for now I am unsure, I will keep this failed model here just in case.
Table <- bind_rows(pls = postResample(plsPred,df_test$PH),pcr = postResample(pcrPred,df_test$PH),NN = postResample(nnetPred,df_test$PH),Mars = postResample(Marspred,df_test$PH),RF = postResample(RFpred,df_test$PH),Boost = postResample(GBMpred,df_test$PH),xgb = postResample(xgbPred,df_test$PH))
Table$id = c("PLS","PCR","Neural Network","MARS",
"Random Forest","Boosted Trees","xgBoost")
Table
## # A tibble: 7 × 4
## RMSE Rsquared MAE id
## <dbl> <dbl> <dbl> <chr>
## 1 0.126 0.435 0.0954 PLS
## 2 0.128 0.421 0.0969 PCR
## 3 0.114 0.542 0.0880 Neural Network
## 4 0.133 0.422 0.0907 MARS
## 5 0.0900 0.734 0.0645 Random Forest
## 6 0.0994 0.650 0.0750 Boosted Trees
## 7 0.151 0.273 0.120 xgBoost
Comparing all the metrics we can see that Random Forest and Boosted Trees seems to have performed the best on the test data, unfortunately xgBoost did not work in this case and it seem to have failed. Even though this is the algorithm that is popular among many data scientists.
I will now apply my models on the test set from the other excel file and then import them into an excel file.
But first, we will have to read and clean the data
Test <- read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\StudentEvaluation.xlsx")
Test
## # A tibble: 267 × 33
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 D 5.48 24.0 0.27 65.4
## 2 A 5.39 24.0 0.227 63.2
## 3 B 5.29 23.9 0.303 66.4
## 4 B 5.27 23.9 0.186 64.8
## 5 B 5.41 24.2 0.16 69.4
## 6 B 5.29 24.1 0.212 73.4
## 7 A 5.48 23.9 0.243 65.2
## 8 B 5.42 24.1 0.123 67.4
## 9 A 5.41 23.9 0.333 66.8
## 10 D 5.47 24.0 0.256 72.6
## # ℹ 257 more rows
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## # `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## # `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## # `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## # `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## # `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>, …
Now I have to perform the same data-cleaning and preprocessing procedures,
str(Test)
## tibble [267 × 33] (S3: tbl_df/tbl/data.frame)
## $ Brand Code : chr [1:267] "D" "A" "B" "B" ...
## $ Carb Volume : num [1:267] 5.48 5.39 5.29 5.27 5.41 ...
## $ Fill Ounces : num [1:267] 24 24 23.9 23.9 24.2 ...
## $ PC Volume : num [1:267] 0.27 0.227 0.303 0.186 0.16 ...
## $ Carb Pressure : num [1:267] 65.4 63.2 66.4 64.8 69.4 73.4 65.2 67.4 66.8 72.6 ...
## $ Carb Temp : num [1:267] 135 135 140 139 142 ...
## $ PSC : num [1:267] 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
## $ PSC Fill : num [1:267] 0.4 0.22 0.1 0.2 0.3 ...
## $ PSC CO2 : num [1:267] 0.04 0.08 0.02 0.02 0.06 ...
## $ Mnf Flow : num [1:267] -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb Pressure1 : num [1:267] 117 119 120 125 115 ...
## $ Fill Pressure : num [1:267] 46 46.2 45.8 40 51.4 46.4 46.2 40 43.8 40.8 ...
## $ Hyd Pressure1 : num [1:267] 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure2 : num [1:267] NA 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure3 : num [1:267] NA 0 0 0 0 0 0 0 0 0 ...
## $ Hyd Pressure4 : num [1:267] 96 112 98 132 94 94 108 108 110 106 ...
## $ Filler Level : num [1:267] 129 120 119 120 116 ...
## $ Filler Speed : num [1:267] 3986 4012 4010 NA 4018 ...
## $ Temperature : num [1:267] 66 65.6 65.6 74.4 66.4 66.6 66.8 NA 65.8 66 ...
## $ Usage cont : num [1:267] 21.7 17.6 24.2 18.1 21.3 ...
## $ Carb Flow : num [1:267] 2950 2916 3056 28 3214 ...
## $ Density : num [1:267] 0.88 1.5 0.9 0.74 0.88 0.84 1.48 1.6 1.52 1.48 ...
## $ MFR : num [1:267] 728 736 735 NA 752 ...
## $ Balling : num [1:267] 1.4 2.94 1.45 1.06 1.4 ...
## $ Pressure Vacuum : num [1:267] -3.8 -4.4 -4.2 -4 -4 -3.8 -4.2 -4.4 -4.4 -4.2 ...
## $ PH : logi [1:267] NA NA NA NA NA NA ...
## $ Oxygen Filler : num [1:267] 0.022 0.03 0.046 NA 0.082 0.064 0.042 0.096 0.046 0.096 ...
## $ Bowl Setpoint : num [1:267] 130 120 120 120 120 120 120 120 120 120 ...
## $ Pressure Setpoint: num [1:267] 45.2 46 46 46 50 46 46 46 46 46 ...
## $ Air Pressurer : num [1:267] 143 147 147 146 146 ...
## $ Alch Rel : num [1:267] 6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
## $ Carb Rel : num [1:267] 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
## $ Balling Lvl : num [1:267] 1.48 3.04 1.46 1.48 1.46 1.44 3.02 3 3.1 3.12 ...
Glancinng at the test data, it more or less resembles the structure of the training set, and thus I will clean and preprocess the data the same way.
# Training3 is going to be the final cleaned up version
Test2 <- Test
Should preprocess the data into training/testing before I actually applied it to the training set.
# Create the dummy variables using predict and the Y variable (PH) will not be available in the trainingset
library(caret)
dummy_mod1 <- dummyVars(PH~.,data = Test2)
Test2Dat_Mat <- predict(dummy_mod1,newdata = Test2)
## Convert it to a data-frame and this is the final Training set.
TestTwo <- data.frame(Test2Dat_Mat)
## I will remove the NA values from the test set..
TestTwo <- TestTwo %>%
na.omit()
I tested my models before with NA values but the model seems to spit out results that only has columns with values in it.
PlsTest <- predict(plsTune,TestTwo)
pcRTest <- predict(pcrTune,TestTwo)
NeuralTest <- predict(nnetModel,TestTwo)
MarsTest <- predict(Marsmodel,TestTwo)
RFTest <- predict(RandomF,TestTwo)
BoostTest <- predict(gbmModel,TestTwo)
xgBoostTest <- predict(xgBoostModel,TestTwo)
TestTable <- bind_cols("PLS" = PlsTest,"PCR" = pcRTest,"Neural Network" = NeuralTest,"MARS" = MarsTest,"Random Forest" = RFTest,"Boosted Tree" = BoostTest,"xgBoost" = xgBoostTest)
TestTable
## # A tibble: 200 × 7
## PLS PCR `Neural Network` MARS[,"y"] `Random Forest` `Boosted Tree`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8.61 8.62 8.42 8.58 8.50 8.46
## 2 8.62 8.62 8.51 8.56 8.54 8.51
## 3 8.50 8.51 8.49 8.54 8.51 8.44
## 4 8.57 8.56 8.48 8.50 8.48 8.51
## 5 8.69 8.67 8.55 8.57 8.56 8.67
## 6 8.60 8.62 8.42 8.56 8.54 8.49
## 7 8.60 8.60 8.54 8.60 8.54 8.58
## 8 8.62 8.62 8.76 8.70 8.60 8.65
## 9 8.37 8.39 8.28 8.19 8.26 8.20
## 10 8.60 8.61 8.60 8.57 8.58 8.55
## # ℹ 190 more rows
## # ℹ 1 more variable: xgBoost <dbl>
library(writexl)
## Warning: package 'writexl' was built under R version 4.3.3
write_xlsx(TestTable,"C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\Finalpredictions.xlsx")
After cleaning and creating a variety of models from this data set, it seems like using Random Forest and Boosted Trees are the best options to use on the test set since they performed really well in predicting PH values. In a business setting, if we took a look at the variable importance when constructing the tree, we note that in the randomf forest MnF Flow and Usage Cont and Brand Code C seems to be important predictors when predicting PH values, These same predictors were also found when constructing the Boosted Trees. While there is not enough domain knowledge in hand, I would assume that the PH values were possibly influenced by how Brand C measured their PH?, and how the other key predictors has had an important hand in their PH values. Either way I would possibly tell stakeholders or the higher-ups that we should take careful consideration on how we handle Mnf Flow and Usage Cont and to possibly learn more on how Brand Code C handles their Ph measurements or values. Note I am really dissapointed with how my xgboost model turned out I was really hoping it would outperform the random forest and the boosted trees but I wanted to keep this model in to showcase that not everything goes right in these sort of machine-learning workflow. Testing the models on the actual testing set, I noticed that if there were any Na values in anyone of the columns the model would seem to ignore giving the Ph values for those particular observations, hance why the predictions in the excel file only contains 200 predictions instead of the usual 267. Ultimately, I would choose the Random Forest model because it performed really well on the training data and the Rsquared and RMSE metrics are good.