This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach. Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
For this second project, we have used a number of models available in
theCaret library. As we are interested in playing different
values to tune the many models to be made with this dataset, a simple
10-fold cross validation has been employed as the data set is rather
small and it is a rather safe option to cross-validate the data. We will
choose the best model based on the various performance metrics.
Training <- read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\StudentData.xlsx")
Training <- Training |>
mutate (`Brand Code` = factor(`Brand Code`))
#Replacing the spaces in the column name with '.' to prevent issues with the mice package.
colnames(Training) = gsub(" ",".",colnames(Training))
head(Training) |>
datatable()
dfSummary(Training,
style = "grid",
plain.ascii = FALSE,
headings=FALSE,
tmp.img.dir = "/tmp")
## temporary images written to 'C:\tmp'
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
|---|---|---|---|---|---|---|
| 1 | Brand.Code [factor] |
1. A 2. B 3. C 4. D |
293 (12.0%) 1239 (50.6%) 304 (12.4%) 615 (25.1%) |
2451 (95.3%) |
120 (4.7%) |
|
| 2 | Carb.Volume [numeric] |
Mean (sd) : 5.4 (0.1) min < med < max: 5 < 5.3 < 5.7 IQR (CV) : 0.2 (0) |
101 distinct values | 2561 (99.6%) |
10 (0.4%) |
|
| 3 | Fill.Ounces [numeric] |
Mean (sd) : 24 (0.1) min < med < max: 23.6 < 24 < 24.3 IQR (CV) : 0.1 (0) |
92 distinct values | 2533 (98.5%) |
38 (1.5%) |
|
| 4 | PC.Volume [numeric] |
Mean (sd) : 0.3 (0.1) min < med < max: 0.1 < 0.3 < 0.5 IQR (CV) : 0.1 (0.2) |
454 distinct values | 2532 (98.5%) |
39 (1.5%) |
|
| 5 | Carb.Pressure [numeric] |
Mean (sd) : 68.2 (3.5) min < med < max: 57 < 68.2 < 79.4 IQR (CV) : 5 (0.1) |
106 distinct values | 2544 (98.9%) |
27 (1.1%) |
|
| 6 | Carb.Temp [numeric] |
Mean (sd) : 141.1 (4) min < med < max: 128.6 < 140.8 < 154 IQR (CV) : 5.4 (0) |
123 distinct values | 2545 (99.0%) |
26 (1.0%) |
|
| 7 | PSC [numeric] |
Mean (sd) : 0.1 (0) min < med < max: 0 < 0.1 < 0.3 IQR (CV) : 0.1 (0.6) |
143 distinct values | 2538 (98.7%) |
33 (1.3%) |
|
| 8 | PSC.Fill [numeric] |
Mean (sd) : 0.2 (0.1) min < med < max: 0 < 0.2 < 0.6 IQR (CV) : 0.2 (0.6) |
60 distinct values | 2548 (99.1%) |
23 (0.9%) |
|
| 9 | PSC.CO2 [numeric] |
Mean (sd) : 0.1 (0) min < med < max: 0 < 0 < 0.2 IQR (CV) : 0.1 (0.8) |
23 distinct values | 2532 (98.5%) |
39 (1.5%) |
|
| 10 | Mnf.Flow [numeric] |
Mean (sd) : 24.6 (119.5) min < med < max: -100.2 < 65.2 < 229.4 IQR (CV) : 240.8 (4.9) |
487 distinct values | 2569 (99.9%) |
2 (0.1%) |
|
| 11 | Carb.Pressure1 [numeric] |
Mean (sd) : 122.6 (4.7) min < med < max: 105.6 < 123.2 < 140.2 IQR (CV) : 6.4 (0) |
140 distinct values | 2539 (98.8%) |
32 (1.2%) |
|
| 12 | Fill.Pressure [numeric] |
Mean (sd) : 47.9 (3.2) min < med < max: 34.6 < 46.4 < 60.4 IQR (CV) : 4 (0.1) |
108 distinct values | 2549 (99.1%) |
22 (0.9%) |
|
| 13 | Hyd.Pressure1 [numeric] |
Mean (sd) : 12.4 (12.4) min < med < max: -0.8 < 11.4 < 58 IQR (CV) : 20.2 (1) |
245 distinct values | 2560 (99.6%) |
11 (0.4%) |
|
| 14 | Hyd.Pressure2 [numeric] |
Mean (sd) : 21 (16.4) min < med < max: 0 < 28.6 < 59.4 IQR (CV) : 34.6 (0.8) |
207 distinct values | 2556 (99.4%) |
15 (0.6%) |
|
| 15 | Hyd.Pressure3 [numeric] |
Mean (sd) : 20.5 (16) min < med < max: -1.2 < 27.6 < 50 IQR (CV) : 33.4 (0.8) |
192 distinct values | 2556 (99.4%) |
15 (0.6%) |
|
| 16 | Hyd.Pressure4 [numeric] |
Mean (sd) : 96.3 (13.1) min < med < max: 52 < 96 < 142 IQR (CV) : 16 (0.1) |
40 distinct values | 2541 (98.8%) |
30 (1.2%) |
|
| 17 | Filler.Level [numeric] |
Mean (sd) : 109.3 (15.7) min < med < max: 55.8 < 118.4 < 161.2 IQR (CV) : 21.7 (0.1) |
288 distinct values | 2551 (99.2%) |
20 (0.8%) |
|
| 18 | Filler.Speed [numeric] |
Mean (sd) : 3687.2 (770.8) min < med < max: 998 < 3982 < 4030 IQR (CV) : 110 (0.2) |
244 distinct values | 2514 (97.8%) |
57 (2.2%) |
|
| 19 | Temperature [numeric] |
Mean (sd) : 66 (1.4) min < med < max: 63.6 < 65.6 < 76.2 IQR (CV) : 1.2 (0) |
56 distinct values | 2557 (99.5%) |
14 (0.5%) |
|
| 20 | Usage.cont [numeric] |
Mean (sd) : 21 (3) min < med < max: 12.1 < 21.8 < 25.9 IQR (CV) : 5.4 (0.1) |
481 distinct values | 2566 (99.8%) |
5 (0.2%) |
|
| 21 | Carb.Flow [numeric] |
Mean (sd) : 2468.4 (1073.7) min < med < max: 26 < 3028 < 5104 IQR (CV) : 2042 (0.4) |
533 distinct values | 2569 (99.9%) |
2 (0.1%) |
|
| 22 | Density [numeric] |
Mean (sd) : 1.2 (0.4) min < med < max: 0.2 < 1 < 1.9 IQR (CV) : 0.7 (0.3) |
78 distinct values | 2570 (100.0%) |
1 (0.0%) |
|
| 23 | MFR [numeric] |
Mean (sd) : 704 (73.9) min < med < max: 31.4 < 724 < 868.6 IQR (CV) : 24.7 (0.1) |
587 distinct values | 2359 (91.8%) |
212 (8.2%) |
|
| 24 | Balling [numeric] |
Mean (sd) : 2.2 (0.9) min < med < max: -0.2 < 1.6 < 4 IQR (CV) : 1.8 (0.4) |
217 distinct values | 2570 (100.0%) |
1 (0.0%) |
|
| 25 | Pressure.Vacuum [numeric] |
Mean (sd) : -5.2 (0.6) min < med < max: -6.6 < -5.4 < -3.6 IQR (CV) : 0.6 (-0.1) |
16 distinct values | 2571 (100.0%) |
0 (0.0%) |
|
| 26 | PH [numeric] |
Mean (sd) : 8.5 (0.2) min < med < max: 7.9 < 8.5 < 9.4 IQR (CV) : 0.2 (0) |
52 distinct values | 2567 (99.8%) |
4 (0.2%) |
|
| 27 | Oxygen.Filler [numeric] |
Mean (sd) : 0 (0) min < med < max: 0 < 0 < 0.4 IQR (CV) : 0 (1) |
338 distinct values | 2559 (99.5%) |
12 (0.5%) |
|
| 28 | Bowl.Setpoint [numeric] |
Mean (sd) : 109.3 (15.3) min < med < max: 70 < 120 < 140 IQR (CV) : 20 (0.1) |
11 distinct values | 2569 (99.9%) |
2 (0.1%) |
|
| 29 | Pressure.Setpoint [numeric] |
Mean (sd) : 47.6 (2) min < med < max: 44 < 46 < 52 IQR (CV) : 4 (0) |
8 distinct values | 2559 (99.5%) |
12 (0.5%) |
|
| 30 | Air.Pressurer [numeric] |
Mean (sd) : 142.8 (1.2) min < med < max: 140.8 < 142.6 < 148.2 IQR (CV) : 0.8 (0) |
32 distinct values | 2571 (100.0%) |
0 (0.0%) |
|
| 31 | Alch.Rel [numeric] |
Mean (sd) : 6.9 (0.5) min < med < max: 5.3 < 6.6 < 8.6 IQR (CV) : 0.7 (0.1) |
53 distinct values | 2562 (99.6%) |
9 (0.4%) |
|
| 32 | Carb.Rel [numeric] |
Mean (sd) : 5.4 (0.1) min < med < max: 5 < 5.4 < 6.1 IQR (CV) : 0.2 (0) |
42 distinct values | 2561 (99.6%) |
10 (0.4%) |
|
| 33 | Balling.Lvl [numeric] |
Mean (sd) : 2.1 (0.9) min < med < max: 0 < 1.5 < 3.7 IQR (CV) : 1.8 (0.4) |
82 distinct values | 2570 (100.0%) |
1 (0.0%) |
Some of the columns looks skewed,while others are sparse and a few looks normally distributed but slightly skewed.. it appears Hyd_Press 1,2, and 3 appears to have a lot of zeros in their distribution, this is kinda worrisome those columns seems to have sparse data. It appears most of the Brand Code are B followed by D, A and C seems to be of equal proportion.
Training %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplot of Numerical Variables")
## Imputation There are missing values and some outliers that will be
handled through imputation using the
mice package. We will
also remove any predictors with near zero variance using the
nearZeroVar function from the caret
library.
set.seed(1234)
Training <- Training |>
mice(m=1,print=F) |>
complete()
Training <- Training[, -nearZeroVar(Training)]
Let’s take a look at the correlation matrix and see what works.
## Correlation_Matrix only works if there are no empty data and the predictors are numeric.
Training2 <- Training %>%
keep(is.numeric)
corrplot(cor(Training2),method = "circle",type="lower")
Looking at the correlation plot, we can see some predictors are strongly correlated with other predictors i.e we see that carb volume is correlated with density, balling ,AlCh Rel and etc, Interestingly the ph(our response variable doesn’t seem to have any strong correlation with the predictors just a negative correlation..)
Since the Brand Code column is essentially a character column, the creation of a one-hot encoding for this column was performed. As it is wise to preserve as much of the variables as possible, and most ML algorithms require columns to be numeric.
# Create the dummy variables using predict and the Y variable (PH) will not be available in the trainingset
library(caret)
dummy_mod <- dummyVars(PH~.,data = Training)
Training3Dat_Mat <- predict(dummy_mod,newdata = Training)
## Convert it to a data-frame and this is the final Training set.
TrainingThree <- data.frame(Training3Dat_Mat)
# Need to preserve this column so I can create testing and training splits.
TrainingThree <- cbind(TrainingThree,Training$PH)
## Rename the Training3Ph column name to just PH
TrainingThree <- TrainingThree %>%
rename(PH = `Training$PH`)
Here we will create,train and tune a bunch of models with the data.
library(caret)
## Let's fit a PLS model..
set.seed(15)
# We are going to use a simple 10-fold cross validation for our re sampling purposes. and make training and testing set here..
ctrl <- trainControl(method = "cv",number = 10)
train_indicies <- createDataPartition(TrainingThree$PH,p = 0.8,list = FALSE)
df_train <- TrainingThree[train_indicies,]
df_test <- TrainingThree[-train_indicies,]
set.seed(100)
plsTune <- train(PH~.,data = df_train,method = "pls",tuneLength = 30,trControl = ctrl,preProc = c("center","scale"))
plsTune$results |> datatable()
ggplot(plsTune)
The # of components seems to have converge around the last few components.
summary(plsTune)
## Data: X dimension: 2058 34
## Y dimension: 2058 1
## Fit method: oscorespls
## Number of components considered: 28
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 15.71 27.6 41.08 48.00 52.53 55.35 58.30
## .outcome 25.74 33.6 37.19 39.12 40.69 41.67 42.35
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 61.06 63.46 66.32 68.97 71.23 73.39 76.23
## .outcome 42.81 43.15 43.29 43.37 43.43 43.47 43.50
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps
## X 77.72 79.24 80.84 82.02 84.47 86.71 88.56
## .outcome 43.57 43.60 43.63 43.68 43.70 43.71 43.71
## 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps
## X 90.25 91.69 92.72 94.44 95.55 97.38 98.42
## .outcome 43.72 43.72 43.72 43.72 43.72 43.72 43.72
plsPred <- predict(plsTune,df_test)
postResample(plsPred,df_test$PH)
## RMSE Rsquared MAE
## 0.1390888 0.3787972 0.1072777
The PLS model seems to perform well on the test data set and the R-squared is somewhat high and the RMSE is somewhat low.
Let’s attempt to perform a Principal Component Regression Model with the Caret Package.
set.seed(100)
pcrTune <- train(PH~.,data = df_train,method = "pcr",tuneLength = 30,trControl = ctrl,preProc = c("center","scale"))
pcrTune$results |>datatable()
ggplot(pcrTune)
It seems the lowest RMSE was at the 27th component.
pcrPred <- predict(pcrTune,df_test)
postResample(pcrPred,df_test$PH)
## RMSE Rsquared MAE
## 0.1389831 0.3794477 0.1075790
The PCR seems no different than from the pls regression, the pcr performed slightly worse than the pls though.
Let’s try to incorporate neural-networks and MARS in this section
Neural Networks have a tendency to over fit the relationships between the predictors and the response due to the large numbers of regression coefficients you can prevent with this with early stopping and to use weight decay
lambda should be between 0 and 0.1 and should be scaled and centered
Try to perform model-averaging, since it seems to provide better results according to the textbook.
## Find highly correlated columns and remove them from the data
tooCorrelated <- findCorrelation(cor(df_train),cutoff = .75)
trainXnet <- df_train[,-tooCorrelated]
testXnet <- df_test[,-tooCorrelated]
I mostly followed the textbook and tried to remove highly correlated predictors from my trainingset before I fed it into the neural network.
## Let's create a bunch of candidate set of models to evaluate
nnetGrid <- expand.grid(.decay = c(0,0.01,0.1),.size = c(1:5),.bag = FALSE)
## Let's train a Neural Network Model and tweak the settings. I halved the parameters since it takes the neural network a while to make
set.seed(100)
library(doParallel)
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
nnetModel <- train(PH~.,data = trainXnet,method = "avNNet",tuneGrid = nnetGrid,trControl = ctrl,preProc = c("center","scale"),linout = TRUE,trace = FALSE,MaxNWts = 5 * (ncol(trainXnet)+ 1) + 5 + 1,maxit = 250)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetModel$results |> datatable()
ggplot(nnetModel)
Looking at the plot it seems the two weight decay are very close together with 0.01 winning at the fifth hidden unit.
nnetPred <- predict(nnetModel,testXnet)
postResample(nnetPred,testXnet$PH)
## RMSE Rsquared MAE
## 0.12351504 0.51176183 0.09473186
The neural network seems to perform well with the current tweaking of the parameters and performed better than the pls and pcr model.
Let’s create a MARS model now.. There are two tuning parameters, the degree of the features that are added to the model and the # of retained terms, the latter can be determined by the default pruning procedure (GCV), model summary on page 149 and nprune is the # of retained terms. (As noted by the textbook.)
# Stick with the default resampling and use 2:30 to see how much terms get retained.
set.seed(100)
Marsgrid <- expand.grid(degree = 1:3,nprune = 2:30)
Marsmodel <- train(PH~.,data = df_train,method = "earth",tuneGrid = Marsgrid,trControl = ctrl)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.3.2
ggplot(Marsmodel)
It seems like high degrees and large prune values seems to capture most of the variance and produce really low RMSE values.
Marspred <- predict(Marsmodel,df_test)
postResample(Marspred,df_test$PH)
## RMSE Rsquared MAE
## 0.12329640 0.51203951 0.09219793
Seems like the Mars Model performed worse than the neural network, I am wondering why?, was it because of the one hot encoding used on the brand_code?
Finally, let’s create a random forest/ boosting model
Let’s try and create a randomforest model, two parameters that can be tuned are the complexity parameter and the maximum node depth
## Breiman suggest tuning the mtry to 1/3 of the total predictors, start with 5 random values of k, and we will use a small # of trees.. (pg 200)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(100)
rfGrid <- expand.grid(mtry = c(2,4,6,8,10))
RandomF <- train(PH~.,data = df_train,trControl = ctrl,tuneGrid = rfGrid,method = "rf")
RandomF$results |> datatable()
ggplot(RandomF)
Seems like it using between eight to ten predictors per split seems to make slight reduction on the RMSE values..
ggplot(varImp(RandomF,scale = TRUE))
Seems like Mnf Flow and Usage Cont and Brand Code C were important predictors in constructing the random forest model.
RFpred <- predict(RandomF,df_test)
postResample(RFpred,df_test$PH)
## RMSE Rsquared MAE
## 0.10202434 0.69200305 0.07291014
Not surprisingly, the randomForest did seems to perform really well on the test-data, I am going to assume that boosted trees and xgboost will perform even better on this type of data.
Let’s try to create boosted trees now.
## I will start with a simple interaction depth, and use 100 iterations of tree between 100 to 1000 and use the two shrinkage values as noted in the textbook. n.minobsinnode is the integer specifying the min number of observations in the terminal node of the tree
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 100),
.shrinkage = c(0.01, 0.1),.n.minobsinnode = 10)
set.seed(100)
gbmModel <- train(PH~.,data = df_train,tuneGrid = gbmGrid,method = "gbm",verbose = FALSE)
gbmModel$results |> datatable()
ggplot(gbmModel)
Higher shrinkage values seems to provide lower RMSE values and max tree depth which is seven in this case seems to provide the lowest. But likewise higher tree depth can overfit on the training data.
summary(gbmModel,plot = FALSE)|> datatable()
It seems like Usage cont and Mnf Flow had a strong influence on the formation of the boosted trees, followed by Brand code C and Oxygen Filler.
GBMpred <- predict(gbmModel,df_test)
postResample(GBMpred,df_test$PH)
## RMSE Rsquared MAE
## 0.11067048 0.60673144 0.08022185
Seems like the boosting model performed slightly worse than the random forest model, but it still performed well on the testing data.
Finally, I want to try out a popular machine learning algorithm aka xgboost, due to the nature of this algorithm I believe this particular technique will work out the best.
## https://www.youtube.com/watch?v=lwv8GneTUOk&ab_channel=DataGarden (source for tweaking the xgboost model..)
## May have to create a new tuning grid for xgboost.. gamma for 5 is suggested if u have no clue what to do look at article for tuning
xgBgrid <- expand.grid(nrounds = 100,eta = c(0.1,0.3),max_depth = c(2,4,6),colsample_bytree = c(0.5,0.9),subsample = c(0.5,0.8),gamma = 5,min_child_weight = c(0,10))
xgBoostModel <- train(PH~.,data = df_train,method = "xgbTree",objective = "reg:squarederror",trControl = ctrl,tuneGrid = xgBgrid)
ggplot(varImp(xgBoostModel))
xgbPred <- predict(xgBoostModel,df_test)
postResample(xgbPred,df_test$PH)
## RMSE Rsquared MAE
## 0.1622689 0.2303943 0.1275589
I don’t understand what happened here, but xgboost did not do well on my data set,I am trying to figure out why but for now I am unsure, I will keep this failed model here just in case.
Table <- bind_rows(pls = postResample(plsPred,df_test$PH),pcr = postResample(pcrPred,df_test$PH),NN = postResample(nnetPred,df_test$PH),Mars = postResample(Marspred,df_test$PH),RF = postResample(RFpred,df_test$PH),Boost = postResample(GBMpred,df_test$PH),xgb = postResample(xgbPred,df_test$PH))
Table$id = c("PLS","PCR","Neural Network","MARS",
"Random Forest","Boosted Trees","xgBoost")
Table |> datatable()
Comparing all the metrics we can see that Random Forest and Boosted Trees seems to have performed the best on the test data, unfortunately xgBoost did not work in this case and it seem to have failed. Even though this is the algorithm that is popular among many data scientists.
I will now apply my models on the test set from the other excel file and then import them into an excel file.
But first, we will have to read and clean the data
Test <- read_excel("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 624\\StudentEvaluation.xlsx")
Test <- Test |>
mutate (`Brand Code` = factor(`Brand Code`))
#Replacing the spaces in the column name with '.' to prevent issues with the mice package.
colnames(Test) = gsub(" ",".",colnames(Test))
Test |> datatable()
Now I have to perform the same data-cleaning and preprocessing procedures,
dfSummary(Test,
style = "grid",
plain.ascii = FALSE,
headings=FALSE,
tmp.img.dir = "/tmp")
## temporary images written to 'C:\tmp'
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
|---|---|---|---|---|---|---|
| 1 | Brand.Code [factor] |
1. A 2. B 3. C 4. D |
35 (13.5%) 129 (49.8%) 31 (12.0%) 64 (24.7%) |
259 (97.0%) |
8 (3.0%) |
|
| 2 | Carb.Volume [numeric] |
Mean (sd) : 5.4 (0.1) min < med < max: 5.1 < 5.3 < 5.7 IQR (CV) : 0.2 (0) |
71 distinct values | 266 (99.6%) |
1 (0.4%) |
|
| 3 | Fill.Ounces [numeric] |
Mean (sd) : 24 (0.1) min < med < max: 23.7 < 24 < 24.2 IQR (CV) : 0.1 (0) |
52 distinct values | 261 (97.8%) |
6 (2.2%) |
|
| 4 | PC.Volume [numeric] |
Mean (sd) : 0.3 (0.1) min < med < max: 0.1 < 0.3 < 0.5 IQR (CV) : 0.1 (0.2) |
183 distinct values | 263 (98.5%) |
4 (1.5%) |
|
| 5 | Carb.Pressure [numeric] |
Mean (sd) : 68.3 (3.9) min < med < max: 60.2 < 68 < 77.6 IQR (CV) : 5.3 (0.1) |
76 distinct values | 267 (100.0%) |
0 (0.0%) |
|
| 6 | Carb.Temp [numeric] |
Mean (sd) : 141.2 (4.3) min < med < max: 130 < 140.8 < 154 IQR (CV) : 5.4 (0) |
88 distinct values | 266 (99.6%) |
1 (0.4%) |
|
| 7 | PSC [numeric] |
Mean (sd) : 0.1 (0.1) min < med < max: 0 < 0.1 < 0.2 IQR (CV) : 0.1 (0.6) |
98 distinct values | 262 (98.1%) |
5 (1.9%) |
|
| 8 | PSC.Fill [numeric] |
Mean (sd) : 0.2 (0.1) min < med < max: 0 < 0.2 < 0.6 IQR (CV) : 0.2 (0.6) |
48 distinct values | 264 (98.9%) |
3 (1.1%) |
|
| 9 | PSC.CO2 [numeric] |
Mean (sd) : 0.1 (0) min < med < max: 0 < 0 < 0.2 IQR (CV) : 0 (0.8) |
18 distinct values | 262 (98.1%) |
5 (1.9%) |
|
| 10 | Mnf.Flow [numeric] |
Mean (sd) : 21 (117.8) min < med < max: -100.2 < 0.2 < 220.4 IQR (CV) : 241.3 (5.6) |
114 distinct values | 267 (100.0%) |
0 (0.0%) |
|
| 11 | Carb.Pressure1 [numeric] |
Mean (sd) : 123 (4.4) min < med < max: 113 < 123.4 < 136 IQR (CV) : 5.3 (0) |
87 distinct values | 263 (98.5%) |
4 (1.5%) |
|
| 12 | Fill.Pressure [numeric] |
Mean (sd) : 48.1 (3.4) min < med < max: 37.8 < 47.8 < 60.2 IQR (CV) : 4.2 (0.1) |
61 distinct values | 265 (99.3%) |
2 (0.7%) |
|
| 13 | Hyd.Pressure1 [numeric] |
Mean (sd) : 12 (13.5) min < med < max: -50 < 10.4 < 50 IQR (CV) : 20.4 (1.1) |
115 distinct values | 267 (100.0%) |
0 (0.0%) |
|
| 14 | Hyd.Pressure2 [numeric] |
Mean (sd) : 20.1 (17.2) min < med < max: -50 < 26.8 < 61.4 IQR (CV) : 34.8 (0.9) |
95 distinct values | 266 (99.6%) |
1 (0.4%) |
|
| 15 | Hyd.Pressure3 [numeric] |
Mean (sd) : 19.6 (16.6) min < med < max: -50 < 27.7 < 49.2 IQR (CV) : 33 (0.8) |
89 distinct values | 266 (99.6%) |
1 (0.4%) |
|
| 16 | Hyd.Pressure4 [numeric] |
Mean (sd) : 97.8 (13.9) min < med < max: 68 < 98 < 140 IQR (CV) : 14 (0.1) |
34 distinct values | 263 (98.5%) |
4 (1.5%) |
|
| 17 | Filler.Level [numeric] |
Mean (sd) : 110.3 (15.5) min < med < max: 69.2 < 118.6 < 153.2 IQR (CV) : 19.6 (0.1) |
107 distinct values | 265 (99.3%) |
2 (0.7%) |
|
| 18 | Filler.Speed [numeric] |
Mean (sd) : 3581.4 (911.2) min < med < max: 1006 < 3978 < 4020 IQR (CV) : 184 (0.3) |
82 distinct values | 257 (96.3%) |
10 (3.7%) |
|
| 19 | Temperature [numeric] |
Mean (sd) : 66.2 (1.7) min < med < max: 63.8 < 65.8 < 75.4 IQR (CV) : 1.2 (0) |
36 distinct values | 265 (99.3%) |
2 (0.7%) |
|
| 20 | Usage.cont [numeric] |
Mean (sd) : 20.9 (3) min < med < max: 12.9 < 21.4 < 24.6 IQR (CV) : 5.6 (0.1) |
174 distinct values | 265 (99.3%) |
2 (0.7%) |
|
| 21 | Carb.Flow [numeric] |
Mean (sd) : 2408.6 (1161.4) min < med < max: 0 < 3038 < 3858 IQR (CV) : 2132 (0.5) |
178 distinct values | 267 (100.0%) |
0 (0.0%) |
|
| 22 | Density [numeric] |
Mean (sd) : 1.2 (0.4) min < med < max: 0.1 < 1 < 1.8 IQR (CV) : 0.7 (0.3) |
53 distinct values | 266 (99.6%) |
1 (0.4%) |
|
| 23 | MFR [numeric] |
Mean (sd) : 697.8 (96.4) min < med < max: 15.6 < 724.6 < 784.8 IQR (CV) : 24.4 (0.1) |
162 distinct values | 236 (88.4%) |
31 (11.6%) |
|
| 24 | Balling [numeric] |
Mean (sd) : 2.2 (0.9) min < med < max: 0.9 < 1.6 < 3.8 IQR (CV) : 1.7 (0.4) |
82 distinct values | 266 (99.6%) |
1 (0.4%) |
|
| 25 | Pressure.Vacuum [numeric] |
Mean (sd) : -5.2 (0.6) min < med < max: -6.4 < -5.2 < -3.6 IQR (CV) : 0.8 (-0.1) |
15 distinct values | 266 (99.6%) |
1 (0.4%) |
|
| 26 | PH [logical] |
All NA’s | 0 (0.0%) |
267 (100.0%) |
||
| 27 | Oxygen.Filler [numeric] |
Mean (sd) : 0 (0) min < med < max: 0 < 0 < 0.4 IQR (CV) : 0 (1.1) |
149 distinct values | 264 (98.9%) |
3 (1.1%) |
|
| 28 | Bowl.Setpoint [numeric] |
Mean (sd) : 109.6 (15) min < med < max: 70 < 120 < 130 IQR (CV) : 20 (0.1) |
70 : 9 ( 3.4%) 80 : 15 ( 5.6%) 90 : 35 (13.2%) 100 : 12 ( 4.5%) 110 : 49 (18.4%) 120 : 139 (52.3%) 130 : 7 ( 2.6%) |
266 (99.6%) |
1 (0.4%) |
|
| 29 | Pressure.Setpoint [numeric] |
Mean (sd) : 47.7 (2.1) min < med < max: 44 < 46 < 52 IQR (CV) : 4 (0) |
6 distinct values | 265 (99.3%) |
2 (0.7%) |
|
| 30 | Air.Pressurer [numeric] |
Mean (sd) : 142.8 (1.2) min < med < max: 141.2 < 142.6 < 147.2 IQR (CV) : 0.6 (0) |
24 distinct values | 266 (99.6%) |
1 (0.4%) |
|
| 31 | Alch.Rel [numeric] |
Mean (sd) : 6.9 (0.5) min < med < max: 6.4 < 6.6 < 7.8 IQR (CV) : 0.6 (0.1) |
31 distinct values | 264 (98.9%) |
3 (1.1%) |
|
| 32 | Carb.Rel [numeric] |
Mean (sd) : 5.4 (0.1) min < med < max: 5.2 < 5.4 < 5.7 IQR (CV) : 0.2 (0) |
28 distinct values | 265 (99.3%) |
2 (0.7%) |
|
| 33 | Balling.Lvl [numeric] |
Mean (sd) : 2.1 (0.9) min < med < max: 0 < 1.5 < 3.4 IQR (CV) : 1.7 (0.4) |
53 distinct values | 267 (100.0%) |
0 (0.0%) |
Glancinng at the test data, it more or less resembles the structure of the training set, and thus I will clean and preprocess the data the same way.
Should preprocess the data into training/testing before I actually applied it to the training set.
# Create the dummy variables using predict and the Y variable (PH) will not be available in the trainingset
library(caret)
dummy_mod1 <- dummyVars(PH~.,data = Test)
Test2Dat_Mat <- predict(dummy_mod1,newdata = Test)
## Convert it to a data-frame and this is the final Training set.
TestTwo <- data.frame(Test2Dat_Mat) |>
na.omit()
I tested my models before with NA values but the model seems to spit out results that only has columns with values in it, correction: Now we tested models with data-imputation.
PlsTest <- predict(plsTune,TestTwo)
pcRTest <- predict(pcrTune,TestTwo)
NeuralTest <- predict(nnetModel,TestTwo)
MarsTest <- predict(Marsmodel,TestTwo)
RFTest <- predict(RandomF,TestTwo)
BoostTest <- predict(gbmModel,TestTwo)
xgBoostTest <- predict(xgBoostModel,TestTwo)
TestTable <- bind_cols("PLS" = PlsTest,"PCR" = pcRTest,"Neural Network" = NeuralTest,"MARS" = MarsTest,"Random Forest" = RFTest,"Boosted Tree" = BoostTest,"xgBoost" = xgBoostTest)
TestTable
## # A tibble: 200 × 7
## PLS PCR `Neural Network` MARS[,"y"] `Random Forest` `Boosted Tree`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8.60 8.59 8.47 8.47 8.52 8.47
## 2 8.60 8.60 8.51 8.54 8.51 8.54
## 3 8.52 8.52 8.52 8.48 8.49 8.40
## 4 8.58 8.57 8.48 8.53 8.51 8.49
## 5 8.69 8.67 8.52 8.62 8.56 8.58
## 6 8.59 8.60 8.51 8.51 8.53 8.56
## 7 8.59 8.59 8.52 8.55 8.48 8.53
## 8 8.64 8.63 8.74 8.71 8.61 8.63
## 9 8.42 8.43 8.38 8.34 8.31 8.28
## 10 8.62 8.63 8.65 8.59 8.59 8.52
## # ℹ 190 more rows
## # ℹ 1 more variable: xgBoost <dbl>
library(writexl)
## Warning: package 'writexl' was built under R version 4.3.3
write_xlsx(TestTable,"Finalpredictions.xlsx")
After cleaning and creating a variety of models from this data set, it seems like using Random Forest and Boosted Trees are the best options to use on the test set since they performed really well in predicting PH values. In a business setting, if we took a look at the variable importance when constructing the tree, we note that in the random forest model, MnF Flow and Usage Cont and Brand Code C seems to be important predictors when predicting PH values, These same predictors were also found when constructing the Boosted Trees. While there is not enough domain knowledge in hand, I would assume that the PH values were possibly influenced by how Brand C measured their PH?, and how the other key predictors has had an important hand in their PH values. Either way I would possibly tell stakeholders or the higher-ups that we should take careful consideration on how we handle Mnf Flow and Usage Cont and to possibly learn more on how Brand Code C handles their Ph measurements or values. Note I am really disappointing with how my xgboost model turned out I was really hoping it would outperform the random forest and the boosted trees but I wanted to keep this model in to showcase that not everything goes right in these sort of machine-learning workflow. Testing the models on the actual testing set, I noticed that if there were any Na values in anyone of the columns the model would seem to ignore giving the PH values for those particular observations, hence why the predictions in the excel file only contains 200 predictions instead of the usual 267. Ultimately, I would choose the Random Forest model because it performed really well on the testing data and the R-squared and RMSE metrics are good.