Instructions Project #2 (Team) Assignment

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.

Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.

Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

Load Libraries

rm(list = ls())
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tsibble     1.1.5     ✔ feasts      0.4.1
## ✔ tsibbledata 0.4.1     ✔ fable       0.4.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(ggplot2)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:fabletools':
## 
##     MAE, RMSE
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(AppliedPredictiveModeling)
library(e1071)
## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:fabletools':
## 
##     interpolate
library(caTools)
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(mlbench)
library(ipred)
library(class)
library(kernlab)
## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
## Registered S3 method overwritten by 'inum':
##   method          from   
##   format.interval tsibble
library(rpart)
library(rpart.plot)
library(readxl)
library(corrplot)
## corrplot 0.95 loaded
library(doParallel)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## Loading required package: iterators
## Loading required package: parallel
library(writexl)

Read Datasets

# Define the URL for the raw Excel file (replace with the correct raw URL)
file_url <- "https://raw.githubusercontent.com/Naik-Khyati/data_624/main/p2/StudentData.xlsx"
# Download the file to a temporary location
temp_file <- tempfile(fileext = ".xlsx")
download.file(file_url, destfile = temp_file, mode = "wb")
# Read the Excel file
train <- read_excel(temp_file)
# View the data
print(train)
## # A tibble: 2,571 × 33
##    `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##    <chr>                <dbl>         <dbl>       <dbl>           <dbl>
##  1 B                     5.34          24.0       0.263            68.2
##  2 A                     5.43          24.0       0.239            68.4
##  3 B                     5.29          24.1       0.263            70.8
##  4 A                     5.44          24.0       0.293            63  
##  5 A                     5.49          24.3       0.111            67.2
##  6 A                     5.38          23.9       0.269            66.6
##  7 A                     5.31          23.9       0.268            64.2
##  8 B                     5.32          24.2       0.221            67.6
##  9 B                     5.25          24.0       0.263            64.2
## 10 B                     5.27          24.0       0.231            72  
## # ℹ 2,561 more rows
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>, …
# Clean up the temporary file
unlink(temp_file)

There are 2571 records and 33 columns in the student data (training data).

# Define the URL for the raw Excel file (replace with the correct raw URL)
file_url <- "https://raw.githubusercontent.com/Naik-Khyati/data_624/main/p2/StudentEvaluation.xlsx"
# Download the file to a temporary location
temp_file <- tempfile(fileext = ".xlsx")
download.file(file_url, destfile = temp_file, mode = "wb")
# Read the Excel file
test <- read_excel(temp_file)
# View the data
print(test)
## # A tibble: 267 × 33
##    `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
##    <chr>                <dbl>         <dbl>       <dbl>           <dbl>
##  1 D                     5.48          24.0       0.27             65.4
##  2 A                     5.39          24.0       0.227            63.2
##  3 B                     5.29          23.9       0.303            66.4
##  4 B                     5.27          23.9       0.186            64.8
##  5 B                     5.41          24.2       0.16             69.4
##  6 B                     5.29          24.1       0.212            73.4
##  7 A                     5.48          23.9       0.243            65.2
##  8 B                     5.42          24.1       0.123            67.4
##  9 A                     5.41          23.9       0.333            66.8
## 10 D                     5.47          24.0       0.256            72.6
## # ℹ 257 more rows
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## #   `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## #   `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## #   `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## #   `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## #   `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>, …
# Clean up the temporary file
unlink(temp_file)

There are 267 records and 33 columns in the student evaluation data (test data).

Get Missing value counts

# Formatting Names to be more readable
colnames(train) <- sub(" ", "_", colnames(train))
paste("There are:", sum(is.na(train)), "total missing values in training data")
## [1] "There are: 844 total missing values in training data"
print(colSums(is.na(train)))
##        Brand_Code       Carb_Volume       Fill_Ounces         PC_Volume 
##               120                10                38                39 
##     Carb_Pressure         Carb_Temp               PSC          PSC_Fill 
##                27                26                33                23 
##           PSC_CO2          Mnf_Flow    Carb_Pressure1     Fill_Pressure 
##                39                 2                32                22 
##     Hyd_Pressure1     Hyd_Pressure2     Hyd_Pressure3     Hyd_Pressure4 
##                11                15                15                30 
##      Filler_Level      Filler_Speed       Temperature        Usage_cont 
##                20                57                14                 5 
##         Carb_Flow           Density               MFR           Balling 
##                 2                 1               212                 1 
##   Pressure_Vacuum                PH     Oxygen_Filler     Bowl_Setpoint 
##                 0                 4                12                 2 
## Pressure_Setpoint     Air_Pressurer          Alch_Rel          Carb_Rel 
##                12                 0                 9                10 
##       Balling_Lvl 
##                 1
colnames(test) <- sub(" ", "_", colnames(test))
paste("There are:", sum(is.na(test)), "total missing values in test data")
## [1] "There are: 374 total missing values in test data"
print(colSums(is.na(test)))
##        Brand_Code       Carb_Volume       Fill_Ounces         PC_Volume 
##                 8                 1                 6                 4 
##     Carb_Pressure         Carb_Temp               PSC          PSC_Fill 
##                 0                 1                 5                 3 
##           PSC_CO2          Mnf_Flow    Carb_Pressure1     Fill_Pressure 
##                 5                 0                 4                 2 
##     Hyd_Pressure1     Hyd_Pressure2     Hyd_Pressure3     Hyd_Pressure4 
##                 0                 1                 1                 4 
##      Filler_Level      Filler_Speed       Temperature        Usage_cont 
##                 2                10                 2                 2 
##         Carb_Flow           Density               MFR           Balling 
##                 0                 1                31                 1 
##   Pressure_Vacuum                PH     Oxygen_Filler     Bowl_Setpoint 
##                 1               267                 3                 1 
## Pressure_Setpoint     Air_Pressurer          Alch_Rel          Carb_Rel 
##                 2                 1                 3                 2 
##       Balling_Lvl 
##                 0

In the training data set, Brand_Code has 120 missing records. MFR has 212 missing records. There are 844 total missing values in training data.

Exploratory Data Analysis

Below code provides summary statistics:

reshape_data <- train %>%
                pivot_longer(!`Brand_Code`,names_to = "vars", values_to = "value")
summary_stats <- as.data.frame(reshape_data) %>% 
  group_by(vars) %>%
    summarise(min = min(value, na.rm = TRUE),
                                max = max(value, na.rm = TRUE),
                                mean = mean(value, na.rm = TRUE),
                                median = median(value, na.rm = TRUE),
                                std = sd(value, na.rm = TRUE),
                                var = var(value, na.rm = TRUE))
  
print(summary_stats,n = 32)
## # A tibble: 32 × 7
##    vars                    min      max      mean    median       std        var
##    <chr>                 <dbl>    <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
##  1 Air_Pressurer      141.      148.     143.      143.        1.21      1.47e+0
##  2 Alch_Rel             5.28      8.62     6.90      6.56      0.505     2.55e-1
##  3 Balling             -0.17      4.01     2.20      1.65      0.931     8.67e-1
##  4 Balling_Lvl          0         3.66     2.05      1.48      0.870     7.57e-1
##  5 Bowl_Setpoint       70       140      109.      120        15.3       2.34e+2
##  6 Carb_Flow           26      5104     2468.     3028      1074.        1.15e+6
##  7 Carb_Pressure       57        79.4     68.2      68.2       3.54      1.25e+1
##  8 Carb_Pressure1     106.      140.     123.      123.        4.74      2.25e+1
##  9 Carb_Rel             4.96      6.06     5.44      5.4       0.129     1.66e-2
## 10 Carb_Temp          129.      154      141.      141.        4.04      1.63e+1
## 11 Carb_Volume          5.04      5.7      5.37      5.35      0.106     1.13e-2
## 12 Density              0.24      1.92     1.17      0.98      0.378     1.43e-1
## 13 Fill_Ounces         23.6      24.3     24.0      24.0       0.0875    7.66e-3
## 14 Fill_Pressure       34.6      60.4     47.9      46.4       3.18      1.01e+1
## 15 Filler_Level        55.8     161.     109.      118.       15.7       2.46e+2
## 16 Filler_Speed       998      4030     3687.     3982       771.        5.94e+5
## 17 Hyd_Pressure1       -0.8      58       12.4      11.4      12.4       1.55e+2
## 18 Hyd_Pressure2        0        59.4     21.0      28.6      16.4       2.69e+2
## 19 Hyd_Pressure3       -1.2      50       20.5      27.6      16.0       2.55e+2
## 20 Hyd_Pressure4       52       142       96.3      96        13.1       1.72e+2
## 21 MFR                 31.4     869.     704.      724        73.9       5.46e+3
## 22 Mnf_Flow          -100.      229.      24.6      65.2     119.        1.43e+4
## 23 Oxygen_Filler        0.0024    0.4      0.0468    0.0334    0.0466    2.18e-3
## 24 PC_Volume            0.0793    0.478    0.277     0.271     0.0607    3.68e-3
## 25 PH                   7.88      9.36     8.55      8.54      0.173     2.98e-2
## 26 PSC                  0.002     0.27     0.0846    0.076     0.0493    2.43e-3
## 27 PSC_CO2              0         0.24     0.0564    0.04      0.0430    1.85e-3
## 28 PSC_Fill             0         0.62     0.195     0.18      0.118     1.39e-2
## 29 Pressure_Setpoint   44        52       47.6      46         2.04      4.16e+0
## 30 Pressure_Vacuum     -6.6      -3.6     -5.22     -5.4       0.570     3.25e-1
## 31 Temperature         63.6      76.2     66.0      65.6       1.38      1.91e+0
## 32 Usage_cont          12.1      25.9     21.0      21.8       2.98      8.87e+0
# Generate density plot for each variable in the reshaped data
ggplot(reshape_data, aes(x = value)) + 
  geom_density(fill = "skyblue", alpha = 0.5) +
  facet_wrap(~ vars, scales = "free") +  # Creates a separate plot for each variable
  theme_minimal() + 
  labs(title = "Density Plots for Observed Values by Variable",
       x = "Observed Value",
       y = "Density")
## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_density()`).

# Select numeric predictor columns from the train dataframe
cols <- train %>%
  select(-c('Brand_Code', 'PH')) %>%   # Remove non-numeric and target variables
  colnames() 
# Create feature plot for the numeric predictors against the target PH
featurePlot(train[, cols], 
            train$PH, 
            plot = "scatter")

nearZeroVar(train)
## [1] 13

Data pre processing

Below code creates binary indicator columns for each possible value of Brand_Code (A, B, C, D). Furthermore, we ensure that if there are any NA (missing) values in the binary columns (code_a, code_b, code_c, code_d), they are replaced with 0. This ensures that no missing values remain in these indicator columns, making the data cleaner for analysis. Finally, we remove the original Brand_Code column from the dataset. Next, we Keeps only the rows where the PH column does not have NA values. This ensures that any rows without a valid target value (PH) are excluded, which is important for training a model where the target variable is required.

# Creating binary indicators for each brand category
train <- train %>%
  mutate(
    code_a = ifelse(Brand_Code == "A", 1, 0),
    code_b = ifelse(Brand_Code == "B", 1, 0),
    code_c = ifelse(Brand_Code == "C", 1, 0),
    code_d = ifelse(Brand_Code == "D", 1, 0)
  )
# Replacing any NA values in the binary columns with 0
train <- train %>%
  mutate(
    code_a = replace_na(code_a, 0),
    code_b = replace_na(code_b, 0),
    code_c = replace_na(code_c, 0),
    code_d = replace_na(code_d, 0)
  )
# Removing the Brand_Code column
train <- select(train, -Brand_Code)
# Keeping only rows with complete data in the 'PH' column
train <- train %>%
  filter(!is.na(PH))

Missing Value Imputation

Below code imputes missing values in a dataset using two different methods: Bagging Imputation and KNN Imputation.

## missing value imputation
# Training Data
X_train <- subset(train,select = -`PH`)
X_train_df <- as.data.frame(X_train)
y_train <- train$PH
## Bag Imputation
preProc <- preProcess(X_train_df, method = c("bagImpute"))
X_train_imputed <- predict(preProc, X_train_df)
## KNN Imputation
preProc1 <- preProcess(X_train_df, method = c("knnImpute"))
X_train_imputed_2 <- predict(preProc1, X_train_df)

Train/Test Split

Next, we shall split the imputed datasets (X_train_imputed and X_train_imputed_2).

set.seed(1234)
## split imputed data into train/test
X_train_imputed$y <-  y_train
X_train_imputed_2$y <- y_train
sample <- sample.split(X_train_imputed, SplitRatio = 0.8)
train_final <- subset(X_train_imputed, sample == TRUE)
dev_test <- subset(X_train_imputed, sample == FALSE)
sample_2 <- sample.split(X_train_imputed_2, SplitRatio = 0.8)
train_final_2 <- subset(X_train_imputed_2, sample == TRUE)
dev_test_2 <- subset(X_train_imputed_2, sample == FALSE)

Correlation Matrix

windows.options(width = 40, height = 40)
cor_matrix <- cor(X_train_imputed)
corrplot(cor_matrix,title = "Correlation Plot", method = "square", outline = T, addgrid.col = "darkgray", order="hclust", mar = c(4,0,4,0), addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1, cl.cex = 1, type = "lower")

KNN Model

# KNN
knn_model  <- train(x = subset(train_final_2, select = -y),
                  y = train_final_2$y ,
                  method = "knn",
                  tuneLength = 10)
knn_model 
## k-Nearest Neighbors 
## 
## 1997 samples
##   35 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1997, 1997, 1997, 1997, 1997, 1997, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1329465  0.4399125  0.09703747
##    7  0.1302938  0.4529453  0.09661619
##    9  0.1297094  0.4543195  0.09700129
##   11  0.1295422  0.4541325  0.09748681
##   13  0.1298994  0.4505111  0.09816012
##   15  0.1301336  0.4481842  0.09863434
##   17  0.1305056  0.4447956  0.09921250
##   19  0.1309425  0.4410959  0.09983970
##   21  0.1314935  0.4362927  0.10029864
##   23  0.1319685  0.4323322  0.10095707
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
knn_predictions  <- predict(knn_model , newdata= subset(dev_test_2, select = -y))
knn_accuracy  <- postResample(pred = knn_predictions , obs = dev_test_2$y)
knn_accuracy 
##       RMSE   Rsquared        MAE 
## 0.11294418 0.56213054 0.08753535

Random Forest Model

## Random Forest
random_forest_model  <- randomForest(subset(train_final_2, select = -`y`), train_final_2$y)
random_forest_model 
## 
## Call:
##  randomForest(x = subset(train_final_2, select = -y), y = train_final_2$y) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##           Mean of squared residuals: 0.01007175
##                     % Var explained: 66.43
rf_predictions  <- predict(random_forest_model, subset(dev_test_2, select = -`y`))
rf_accuracy <- postResample(rf_predictions , dev_test_2$y)
rf_accuracy
##       RMSE   Rsquared        MAE 
## 0.09058638 0.73403570 0.06782500

GBM Model

## GBM
gbm_parameter_grid  <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
gbm_tuning  <- train(y ~ ., data = train_final_2,
                 method = "gbm", 
                 tuneGrid = gbm_parameter_grid ,
                 verbose = FALSE)
gbm_predictions  <- predict(gbm_tuning , subset(dev_test_2, select = -y))
gbm_accuracy  <- postResample(gbm_predictions, dev_test_2$y)
gbm_accuracy 
##       RMSE   Rsquared        MAE 
## 0.10245067 0.64407540 0.07834692

PLS Model

Tune a PLS model using a simple 10-fold cross validation resampling:

set.seed(100)
plsTuned <- train(y ~ ., data = train_final_2,
                 method = "pls",tuneLength = 30,
                 trControl =  trainControl(method = "cv",number = 10),
                 preProc = c("center","scale"))
plsTuned
## Partial Least Squares 
## 
## 1997 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1503426  0.2525026  0.1189976
##    2     0.1424523  0.3268504  0.1119158
##    3     0.1394648  0.3553281  0.1100331
##    4     0.1377914  0.3703116  0.1083002
##    5     0.1364919  0.3831704  0.1066698
##    6     0.1355944  0.3915996  0.1053951
##    7     0.1349298  0.3981059  0.1050866
##    8     0.1346732  0.4006188  0.1046466
##    9     0.1345273  0.4018981  0.1044203
##   10     0.1343412  0.4033304  0.1040158
##   11     0.1342041  0.4042434  0.1038793
##   12     0.1342310  0.4039503  0.1039904
##   13     0.1342257  0.4040724  0.1040268
##   14     0.1342422  0.4040293  0.1039981
##   15     0.1342801  0.4037919  0.1040619
##   16     0.1343214  0.4034121  0.1041119
##   17     0.1343345  0.4031636  0.1040913
##   18     0.1343817  0.4026604  0.1040865
##   19     0.1343621  0.4028022  0.1040402
##   20     0.1343945  0.4025858  0.1040724
##   21     0.1343819  0.4027516  0.1040356
##   22     0.1343738  0.4028550  0.1040125
##   23     0.1343301  0.4031859  0.1039783
##   24     0.1343183  0.4032771  0.1039521
##   25     0.1343084  0.4033492  0.1039451
##   26     0.1343002  0.4034023  0.1039419
##   27     0.1342889  0.4035037  0.1039452
##   28     0.1342875  0.4035176  0.1039452
##   29     0.1342811  0.4035659  0.1039402
##   30     0.1342795  0.4035769  0.1039386
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 11.

Make prediction on test data and evaluate PLS model performance:

pls_pred <- predict(plsTuned, newdata= subset(dev_test_2, select = -y))
pls_accuracy<-postResample(pred = pls_pred, obs = dev_test_2$y)
pls_accuracy
##      RMSE  Rsquared       MAE 
## 0.1287502 0.4265420 0.1009250

PCR Model

Tune a PCR model using a simple 10-fold cross validation resampling:

set.seed(100)
pcrTuned <- train(y ~ ., data = train_final_2,
                 method = "pcr",
                 tuneLength = 30,
                 trControl =  trainControl(method = "cv",number = 10),
                 preProc = c("center","scale"))
pcrTuned
## Principal Component Analysis 
## 
## 1997 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared    MAE      
##    1     0.1718646  0.01762451  0.1371454
##    2     0.1586037  0.16720249  0.1258347
##    3     0.1570218  0.18294287  0.1244416
##    4     0.1570420  0.18290136  0.1241536
##    5     0.1571163  0.18188676  0.1241855
##    6     0.1512348  0.24234880  0.1191913
##    7     0.1478717  0.27454935  0.1169642
##    8     0.1479857  0.27353826  0.1170316
##    9     0.1462607  0.29116721  0.1158732
##   10     0.1460452  0.29339826  0.1158269
##   11     0.1452371  0.30077633  0.1150145
##   12     0.1451628  0.30149007  0.1150166
##   13     0.1450040  0.30320170  0.1149718
##   14     0.1446501  0.30701395  0.1147964
##   15     0.1435616  0.31784943  0.1141008
##   16     0.1438172  0.31606113  0.1141846
##   17     0.1428033  0.32530659  0.1131455
##   18     0.1417867  0.33345382  0.1117694
##   19     0.1420172  0.33123903  0.1119606
##   20     0.1400813  0.34812466  0.1107354
##   21     0.1390265  0.35843093  0.1095613
##   22     0.1390869  0.35802389  0.1096102
##   23     0.1391371  0.35798081  0.1098080
##   24     0.1381719  0.36828517  0.1087031
##   25     0.1379988  0.36968739  0.1084605
##   26     0.1369608  0.37894519  0.1072102
##   27     0.1370123  0.37889309  0.1072978
##   28     0.1370640  0.37843593  0.1073096
##   29     0.1345740  0.40114978  0.1043854
##   30     0.1344895  0.40165922  0.1044097
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 30.

Make prediction on test data and evaluate PCR model performance:

pcr_pred <- predict(pcrTuned, newdata= subset(dev_test_2, select = -y))
pcr_accuracy<-postResample(pred = pcr_pred, obs = dev_test_2$y)
pcr_accuracy
##      RMSE  Rsquared       MAE 
## 0.1292489 0.4220536 0.1014411

Neural Network Model

Tune Neural Network model:

# Get highly correlated columns and remove them from training and test data 
highly_correlated <- findCorrelation(cor(train_final_2),cutoff = .75)
train_Xnet <- train_final_2[,-highly_correlated]
test_Xnet <- dev_test_2[,-highly_correlated]

# Create candidate set of models to evaluate
nnetGrid <- expand.grid(.decay = c(0,0.01,0.1),.size = c(1:5),.bag = FALSE)

set.seed(100)
nnetTuned <- train(y~.,data = train_Xnet,
                   method = "avNNet",
                   tuneGrid = nnetGrid,
                   trControl = trainControl(method = "cv",number = 10),
                   preProc = c("center","scale"),
                   linout = TRUE,trace = FALSE,MaxNWts = 5 * (ncol(train_Xnet)+ 1) + 5 + 1,maxit = 250)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetTuned
## Model Averaged Neural Network 
## 
## 1997 samples
##   25 predictor
## 
## Pre-processing: centered (25), scaled (25) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE       
##   0.00   1     0.1466581  0.3035138  0.11577404
##   0.00   2     0.2005945  0.2863020  0.11812740
##   0.00   3     0.1362004  0.3938649  0.10400028
##   0.00   4     0.1291882  0.4448982  0.09858460
##   0.00   5     0.1464837  0.4334126  0.09920246
##   0.01   1     0.1422052  0.3288780  0.11265563
##   0.01   2     0.1339945  0.4046356  0.10502836
##   0.01   3     0.1285895  0.4505014  0.09909846
##   0.01   4     0.1245089  0.4851793  0.09567046
##   0.01   5     0.1247789  0.4830863  0.09520996
##   0.10   1     0.1429982  0.3212229  0.11350535
##   0.10   2     0.1344134  0.4006948  0.10587831
##   0.10   3     0.1286357  0.4512212  0.09959212
##   0.10   4     0.1274502  0.4612314  0.09776311
##   0.10   5     0.1254353  0.4779160  0.09602596
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.

Make prediction on test data and evaluate Neural Network model performance:

nnet_pred <- predict(nnetTuned, newdata= subset(test_Xnet, select = -y))
nnet_accuracy<-postResample(pred = nnet_pred, obs = test_Xnet$y)
nnet_accuracy
##       RMSE   Rsquared        MAE 
## 0.11698486 0.52905319 0.09214485

MARS Model

Tune MARS model using default resampling:

set.seed(100)
marsgrid <- expand.grid(degree = 1:3,nprune = 2:30)

marsTuned<-train(y ~ ., data = train_final_2,
                 method = "earth",
                 tuneGrid = marsgrid,
                 trControl =  trainControl(method = "cv",number = 10),
                 preProc = c("center","scale"))
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 1997 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE       
##   1        2      0.1529547  0.2239762  0.11997116
##   1        3      0.1451487  0.3016040  0.11384292
##   1        4      0.1432637  0.3202380  0.11184079
##   1        5      0.1416102  0.3360887  0.11023236
##   1        6      0.1401369  0.3503129  0.10836513
##   1        7      0.1379969  0.3701586  0.10678623
##   1        8      0.1364008  0.3851537  0.10559972
##   1        9      0.1356149  0.3920051  0.10461549
##   1       10      0.1329486  0.4153091  0.10325588
##   1       11      0.1327544  0.4164348  0.10228196
##   1       12      0.1326903  0.4171781  0.10221403
##   1       13      0.1323324  0.4202533  0.10201317
##   1       14      0.1320768  0.4227396  0.10170424
##   1       15      0.1316303  0.4268054  0.10116537
##   1       16      0.1318719  0.4251477  0.10127711
##   1       17      0.1316430  0.4274757  0.10110835
##   1       18      0.1315941  0.4280534  0.10091550
##   1       19      0.1315941  0.4280534  0.10091550
##   1       20      0.1315941  0.4280534  0.10091550
##   1       21      0.1315941  0.4280534  0.10091550
##   1       22      0.1315941  0.4280534  0.10091550
##   1       23      0.1315941  0.4280534  0.10091550
##   1       24      0.1315941  0.4280534  0.10091550
##   1       25      0.1315941  0.4280534  0.10091550
##   1       26      0.1315941  0.4280534  0.10091550
##   1       27      0.1315941  0.4280534  0.10091550
##   1       28      0.1315941  0.4280534  0.10091550
##   1       29      0.1315941  0.4280534  0.10091550
##   1       30      0.1315941  0.4280534  0.10091550
##   2        2      0.1534183  0.2195273  0.12039606
##   2        3      0.1466813  0.2856167  0.11535622
##   2        4      0.1456379  0.2973107  0.11408554
##   2        5      0.1425876  0.3267202  0.11109689
##   2        6      0.1406668  0.3472048  0.10882931
##   2        7      0.1414177  0.3438076  0.10938939
##   2        8      0.1393029  0.3625760  0.10679149
##   2        9      0.1382944  0.3710933  0.10563762
##   2       10      0.1376960  0.3771047  0.10543621
##   2       11      0.1370275  0.3839179  0.10464815
##   2       12      0.1357681  0.3951972  0.10336700
##   2       13      0.1354655  0.3991030  0.10275233
##   2       14      0.1339245  0.4119980  0.10165412
##   2       15      0.1316917  0.4305050  0.09989991
##   2       16      0.1307698  0.4401823  0.09879455
##   2       17      0.1292417  0.4529989  0.09790942
##   2       18      0.1288209  0.4561648  0.09759841
##   2       19      0.1286744  0.4575672  0.09726496
##   2       20      0.1288687  0.4568837  0.09739498
##   2       21      0.1281908  0.4625339  0.09667287
##   2       22      0.1280102  0.4638934  0.09646312
##   2       23      0.1274060  0.4685275  0.09621850
##   2       24      0.1266720  0.4739322  0.09578023
##   2       25      0.1263278  0.4763104  0.09547193
##   2       26      0.1261539  0.4777784  0.09540930
##   2       27      0.1262246  0.4772829  0.09536583
##   2       28      0.1261877  0.4777085  0.09539824
##   2       29      0.1260255  0.4788105  0.09530204
##   2       30      0.1257873  0.4806389  0.09513345
##   3        2      0.1529547  0.2239762  0.11997116
##   3        3      0.1456463  0.2963680  0.11424491
##   3        4      0.1427016  0.3237116  0.11098700
##   3        5      0.1405485  0.3450519  0.10917268
##   3        6      0.1384343  0.3644972  0.10710921
##   3        7      0.1373972  0.3728565  0.10557776
##   3        8      0.1359429  0.3864454  0.10435467
##   3        9      0.1349495  0.3958971  0.10369222
##   3       10      0.1338498  0.4057464  0.10263162
##   3       11      0.1324577  0.4185147  0.10118773
##   3       12      0.1317300  0.4257988  0.10097386
##   3       13      0.1310221  0.4327902  0.09990873
##   3       14      0.1284114  0.4534883  0.09804658
##   3       15      0.1292054  0.4487715  0.09818156
##   3       16      0.1269489  0.4667162  0.09681454
##   3       17      0.1262097  0.4725133  0.09576426
##   3       18      0.1250709  0.4816390  0.09470195
##   3       19      0.1245805  0.4860489  0.09405047
##   3       20      0.1247285  0.4850913  0.09405229
##   3       21      0.1249612  0.4841608  0.09409856
##   3       22      0.1249451  0.4842656  0.09411946
##   3       23      0.1248127  0.4855871  0.09388316
##   3       24      0.1246949  0.4869685  0.09362219
##   3       25      0.1250028  0.4864160  0.09371541
##   3       26      0.1249981  0.4869485  0.09386094
##   3       27      0.1245131  0.4908756  0.09351394
##   3       28      0.1243361  0.4927872  0.09355877
##   3       29      0.1238487  0.4967387  0.09299816
##   3       30      0.1234745  0.4997570  0.09257993
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 30 and degree = 3.

Make prediction on test data and evaluate MARS model performance:

mars_pred <- predict(marsTuned, newdata= subset(dev_test_2, select = -y))
mars_accuracy<-postResample(pred = mars_pred, obs = dev_test_2$y)
mars_accuracy
##       RMSE   Rsquared        MAE 
## 0.11718086 0.52620674 0.09066281

SVM Model

Tune SVM model:

set.seed(100)
svmTuned<-train(x = subset(train_final_2, select = -y),
                  y = train_final_2$y ,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1997 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE       
##      0.25  0.1268100  0.4758660  0.09398379
##      0.50  0.1235019  0.4997336  0.09073704
##      1.00  0.1207803  0.5197304  0.08845277
##      2.00  0.1184667  0.5364172  0.08683304
##      4.00  0.1164352  0.5521370  0.08520787
##      8.00  0.1169365  0.5508894  0.08588098
##     16.00  0.1181413  0.5470169  0.08738437
##     32.00  0.1216116  0.5299452  0.09007273
##     64.00  0.1283433  0.4950676  0.09495553
##    128.00  0.1356823  0.4625062  0.10017520
##    256.00  0.1414263  0.4410341  0.10440948
##    512.00  0.1465595  0.4188510  0.10827227
##   1024.00  0.1489228  0.4092491  0.11028641
##   2048.00  0.1489228  0.4092491  0.11028641
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01971348
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01971348 and C = 4.

Make prediction on test data and evaluate SVM model performance:

svm_pred <- predict(svmTuned, newdata= subset(dev_test_2, select = -y))
svm_accuracy<-postResample(pred = svm_pred, obs = dev_test_2$y)
svm_accuracy
##       RMSE   Rsquared        MAE 
## 0.10956812 0.58456969 0.08225389

Recursive Partitioning Decision Tree

Tune Recursive Partitioning Decision Tree Model:

set.seed(100)
rpartTune <- train(x = subset(train_final_2, select = -y),
                  y = train_final_2$y ,
method = "rpart2",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
rpartTune
## CART 
## 
## 1997 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE       Rsquared   MAE      
##    1        0.1529747  0.2237745  0.1199919
##    2        0.1472369  0.2813990  0.1157326
##    3        0.1431181  0.3214345  0.1133946
##    4        0.1397949  0.3514484  0.1095457
##    5        0.1389555  0.3587993  0.1086642
##    6        0.1386227  0.3616622  0.1081520
##    7        0.1378278  0.3694417  0.1069444
##    8        0.1371662  0.3758798  0.1065626
##   10        0.1356283  0.3910793  0.1052888
##   12        0.1298444  0.4414623  0.1004344
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 12.

Make prediction on test set and evaluate performance of Recursive Partitioning Decision Tree model:

rpart_pred <- predict(rpartTune, newdata= subset(dev_test_2, select = -y))
rpart_accuracy<-postResample(pred=rpart_pred, obs = dev_test_2$y)
rpart_accuracy
##       RMSE   Rsquared        MAE 
## 0.12291123 0.47715548 0.09575514

Bagged Trees

Tune Bagged Trees model:

set.seed(100)

baggedTree <- ipredbagg(train_final_2$y,subset(train_final_2, select = -y))
                  
bagged_Pred <- predict(baggedTree, newdata= subset(dev_test_2, select = -y))

baggedTree_accuracy<-postResample(pred=rpart_pred, obs = dev_test_2$y)

baggedTree_accuracy
##       RMSE   Rsquared        MAE 
## 0.12291123 0.47715548 0.09575514

Cubist model

Tune Cubist Model:

cubistGrid <- expand.grid(committees = c(1, 5, 10), 
                          neighbors = c(0, 1, 3, 5))

set.seed(100)

cubistTune <- train(x = subset(train_final_2, select = -y),
                  y = train_final_2$y ,
method = "cubist",
preProc = c("center", "scale"),
tuneGrid = cubistGrid,
verbose = FALSE)

cubistTune
## Cubist 
## 
## 1997 samples
##   35 predictor
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1997, 1997, 1997, 1997, 1997, 1997, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE       
##    1          0          0.1551279  0.3711975  0.10266237
##    1          1          0.1611767  0.3879031  0.10643571
##    1          3          0.1555072  0.3966772  0.10231619
##    1          5          0.1541102  0.3972435  0.10129013
##    5          0          0.1184657  0.5386691  0.08368411
##    5          1          0.1215098  0.5423594  0.08496936
##    5          3          0.1174746  0.5575404  0.08236944
##    5          5          0.1167658  0.5591217  0.08202837
##   10          0          0.1104652  0.5905058  0.07850673
##   10          1          0.1131453  0.5874095  0.07910953
##   10          3          0.1093231  0.6050411  0.07683539
##   10          5          0.1085769  0.6080797  0.07656626
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 5.

Make prediction on test set and evaluate performance of Cubist model:

cubist_pred <- predict(cubistTune, newdata= subset(dev_test_2, select = -y))
cubist_accuracy<-postResample(pred=cubist_pred, obs = dev_test_2$y)
cubist_accuracy
##       RMSE   Rsquared        MAE 
## 0.08878050 0.72764140 0.06655013

Combine results of different models into a single table

# Combine results of all models into a single table
all_results<- rbind(
  knn=knn_accuracy,
  rf=rf_accuracy,
  gbm=gbm_accuracy,
  pls=pls_accuracy,
  pcr=pcr_accuracy,
  nnet=nnet_accuracy,
  mars=mars_accuracy,
  svm=svm_accuracy,
  rpart = rpart_accuracy,
  baggedTree=baggedTree_accuracy,
  cubist= cubist_accuracy
)

# Convert to a data frame 
results <- as.data.frame(all_results)

# See results
print(results)
##                  RMSE  Rsquared        MAE
## knn        0.11294418 0.5621305 0.08753535
## rf         0.09058638 0.7340357 0.06782500
## gbm        0.10245067 0.6440754 0.07834692
## pls        0.12875018 0.4265420 0.10092503
## pcr        0.12924892 0.4220536 0.10144108
## nnet       0.11698486 0.5290532 0.09214485
## mars       0.11718086 0.5262067 0.09066281
## svm        0.10956812 0.5845697 0.08225389
## rpart      0.12291123 0.4771555 0.09575514
## baggedTree 0.12291123 0.4771555 0.09575514
## cubist     0.08878050 0.7276414 0.06655013

The Cubist model has a slightly lower RMSE (0.0888) than the Random Forest model (0.0906).This shows that the Cubist model has a slightly better prediction accuracy. However, the Random Forest model has a higher R-squared value (0.7340 vs. 0.7276). This indicates that the Random Forest model explains more variance in the data.

Considering the minor differences in these metrics, Random Forest is a better choice because it provides a better balance between prediction accuracy and explanatory power.

Plot Predictor Importance Scores for Random Forest Model

imp_scores <- varImp(random_forest_model)
imp_scores |> 
  mutate (var = rownames(imp_scores)) |>
  ggplot(aes(Overall, reorder(var,Overall, sum), var)) + 
  geom_col(fill = 'blue') + 
  labs(title = 'Predictor Importance' , y = 'Predictors')

Get Top 10 most important predictors for Random Forest Model

imp_scores <- varImp(random_forest_model)
top10 <- head(imp_scores[order(-imp_scores$Overall), , drop = FALSE], 10)
top10
##                  Overall
## Mnf_Flow        7.378791
## Usage_cont      4.399368
## code_c          2.878830
## Filler_Level    2.723815
## Oxygen_Filler   2.706484
## Bowl_Setpoint   2.613890
## Temperature     2.600523
## Carb_Rel        2.241133
## Pressure_Vacuum 2.158924
## Balling_Lvl     2.125748
# Get correlation matrix
corr_matrix <- cor(select(train, c("PH", row.names(top10))), use = "complete.obs")
corr_matrix
##                         PH     Mnf_Flow   Usage_cont       code_c Filler_Level
## PH               1.0000000 -0.447449397 -0.313766476 -0.295535701   0.32653555
## Mnf_Flow        -0.4474494  1.000000000  0.519156997 -0.001914574  -0.57876422
## Usage_cont      -0.3137665  0.519156997  1.000000000  0.008684086  -0.35034791
## code_c          -0.2955357 -0.001914574  0.008684086  1.000000000   0.05317548
## Filler_Level     0.3265356 -0.578764217 -0.350347913  0.053175479   1.00000000
## Oxygen_Filler    0.1637180 -0.536994185 -0.306667782  0.037792339   0.22825209
## Bowl_Setpoint    0.3492594 -0.580000493 -0.364109851  0.052668762   0.94948508
## Temperature     -0.1820321 -0.071607350 -0.102012924  0.198846869   0.06534243
## Carb_Rel         0.1669034 -0.029104955 -0.030272353 -0.240792365   0.10562619
## Pressure_Vacuum  0.2199029 -0.529363084 -0.300989266 -0.060884498   0.33204757
## Balling_Lvl      0.1098807  0.031960803  0.037975706 -0.221198022   0.04619991
##                 Oxygen_Filler Bowl_Setpoint Temperature     Carb_Rel
## PH                 0.16371802    0.34925942 -0.18203206  0.166903383
## Mnf_Flow          -0.53699419   -0.58000049 -0.07160735 -0.029104955
## Usage_cont        -0.30666778   -0.36410985 -0.10201292 -0.030272353
## code_c             0.03779234    0.05266876  0.19884687 -0.240792365
## Filler_Level       0.22825209    0.94948508  0.06534243  0.105626192
## Oxygen_Filler      1.00000000    0.23105759  0.13322643 -0.010143824
## Bowl_Setpoint      0.23105759    1.00000000  0.04208445  0.127269033
## Temperature        0.13322643    0.04208445  1.00000000 -0.106773288
## Carb_Rel          -0.01014382    0.12726903 -0.10677329  1.000000000
## Pressure_Vacuum    0.22875128    0.34540601  0.03572839 -0.004106525
## Balling_Lvl       -0.05128338    0.06468899 -0.17939526  0.850528168
##                 Pressure_Vacuum Balling_Lvl
## PH                  0.219902868  0.10988070
## Mnf_Flow           -0.529363084  0.03196080
## Usage_cont         -0.300989266  0.03797571
## code_c             -0.060884498 -0.22119802
## Filler_Level        0.332047572  0.04619991
## Oxygen_Filler       0.228751276 -0.05128338
## Bowl_Setpoint       0.345406013  0.06468899
## Temperature         0.035728385 -0.17939526
## Carb_Rel           -0.004106525  0.85052817
## Pressure_Vacuum     1.000000000 -0.04022344
## Balling_Lvl        -0.040223438  1.00000000
# Plot correlation matrix
corrplot(corr_matrix, method = "circle",mar = c(0,0,3, 0)) 
title ("Correlation between pH and top 10 predictors")

Testing Random Forest Predection model on ‘test_data’

Below we will examine the Random Forest models to see the best model to predict the PH values of the dataset ‘test_data’, where we have missing values for PH that we need to predict. We will need to preprocess the test dataset to match the structure of the training dataset, impute the missing values, and uses a trained Random Forest model to predict PH values for the test data. The ‘PH’ column, which is not present in the training data, is temporarily removed from the test dataset. The trained Random Forest model is then applied to predict PH values, and these predictions are appended to the dataset as a new column. Finally, the processed test dataset, including the predicted PH values, is saved to a CSV file

test_1 <- test
#Match the column names with training data
colnames(test_1) <- sub(" ", "_", colnames(test_1))
# Preprocess the test dataset to match the training dataset format
test_1 <- test_1 %>%
  mutate(
    code_a = ifelse(Brand_Code == "A", 1, 0),
    code_b = ifelse(Brand_Code == "B", 1, 0),
    code_c = ifelse(Brand_Code == "C", 1, 0),
    code_d = ifelse(Brand_Code == "D", 1, 0)
  ) %>%
  mutate(
    code_a = replace_na(code_a, 0),
    code_b = replace_na(code_b, 0),
    code_c = replace_na(code_c, 0),
    code_d = replace_na(code_d, 0)
  ) %>%
  select(-Brand_Code)
test_1 <- as.data.frame(test_1)
# Impute missing values in the test dataset using Mean Imputation
test_imputed <- test_1 %>%
  mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
#take PH away to match training data
test_imputed <- test_imputed %>% select(-PH)
# Predict PH values using the trained random forest model
test_imputed$PH <- predict(random_forest_model, newdata = test_imputed)
# View the predicted PH values
head(test_imputed$PH)
## [1] 8.558511 8.561617 8.573331 8.573890 8.575387 8.574257

From here we can impute the new PH value generated into the original ‘test’ dataset. and export it as an excel file.

test_reverted <- test %>%
  mutate(PH = test_imputed$PH) # Remove the temporary prediction column
# View the reverted dataset
tail(test_reverted)
## # A tibble: 6 × 33
##   Brand_Code Carb_Volume Fill_Ounces PC_Volume Carb_Pressure Carb_Temp   PSC
##   <chr>            <dbl>       <dbl>     <dbl>         <dbl>     <dbl> <dbl>
## 1 D                 5.51        24.1     0.193          77        149. 0.096
## 2 B                 5.39        24.0     0.223          66.4      138. 0.096
## 3 D                 5.57        24.1     0.234          62.8      130  0.054
## 4 C                 5.42        24.0     0.273          62.6      133. 0.128
## 5 C                 5.25        24.1     0.361          68.2      145. 0.06 
## 6 C                 5.22        24.0     0.225          64.4      140  0.126
## # ℹ 26 more variables: PSC_Fill <dbl>, PSC_CO2 <dbl>, Mnf_Flow <dbl>,
## #   Carb_Pressure1 <dbl>, Fill_Pressure <dbl>, Hyd_Pressure1 <dbl>,
## #   Hyd_Pressure2 <dbl>, Hyd_Pressure3 <dbl>, Hyd_Pressure4 <dbl>,
## #   Filler_Level <dbl>, Filler_Speed <dbl>, Temperature <dbl>,
## #   Usage_cont <dbl>, Carb_Flow <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## #   Pressure_Vacuum <dbl>, PH <dbl>, Oxygen_Filler <dbl>, Bowl_Setpoint <dbl>,
## #   Pressure_Setpoint <dbl>, Air_Pressurer <dbl>, Alch_Rel <dbl>, …
# Export the test_imputed dataset to a CSV file
write.csv(test_reverted, "test_final2.csv", row.names = FALSE)
write_xlsx(test_reverted, "test_final2.xlsx")

Plotting Accuracy Comparisons

to see how accurate the model preform, we can compare a predicted PH values, with our known PH values from train data.

# Combine predictions from training and testing for comparison
train_final_2$Prediction <- predict(random_forest_model, newdata = train_final_2)
# Enhanced plot for observed vs predicted for training data
ggplot(data = train_final_2, aes(x = y, y = Prediction)) +
  geom_point(color = "blue", alpha = 0.3, size = 1.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Observed vs Predicted (Training Data)",
       x = "Observed PH",
       y = "Predicted PH") +
  theme_minimal() +
  theme(panel.grid.major = element_line(color = "gray", linetype = "dotted"),
        plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 12))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The plot of “Observed vs Predicted (Training Data)” shows a strong positive correlation, as most points lie close to the diagonal red dashed line, which represents a perfect prediction. This alignment suggests that the Random Forest model has effectively captured the relationship between the features and the target variable PH in the training dataset. However, slight deviations are observed, particularly at the extremes of the range, indicating minor inaccuracies in those regions. The clustering of points around specific PH values also reveals that the training data may have more examples in those ranges, resulting in better predictions for those areas.