Instructions Project #2 (Team) Assignment
This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tsibble 1.1.5 ✔ feasts 0.4.1
## ✔ tsibbledata 0.4.1 ✔ fable 0.4.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:fabletools':
##
## MAE, RMSE
##
## The following object is masked from 'package:purrr':
##
## lift
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:fabletools':
##
## interpolate
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
## Registered S3 method overwritten by 'inum':
## method from
## format.interval tsibble
## corrplot 0.95 loaded
## Loading required package: foreach
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
## Loading required package: iterators
## Loading required package: parallel
# Define the URL for the raw Excel file (replace with the correct raw URL)
file_url <- "https://raw.githubusercontent.com/Naik-Khyati/data_624/main/p2/StudentData.xlsx"
# Download the file to a temporary location
temp_file <- tempfile(fileext = ".xlsx")
download.file(file_url, destfile = temp_file, mode = "wb")
# Read the Excel file
train <- read_excel(temp_file)
# View the data
print(train)## # A tibble: 2,571 × 33
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 B 5.34 24.0 0.263 68.2
## 2 A 5.43 24.0 0.239 68.4
## 3 B 5.29 24.1 0.263 70.8
## 4 A 5.44 24.0 0.293 63
## 5 A 5.49 24.3 0.111 67.2
## 6 A 5.38 23.9 0.269 66.6
## 7 A 5.31 23.9 0.268 64.2
## 8 B 5.32 24.2 0.221 67.6
## 9 B 5.25 24.0 0.263 64.2
## 10 B 5.27 24.0 0.231 72
## # ℹ 2,561 more rows
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## # `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## # `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## # `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## # `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## # `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>, …
There are 2571 records and 33 columns in the student data (training data).
# Define the URL for the raw Excel file (replace with the correct raw URL)
file_url <- "https://raw.githubusercontent.com/Naik-Khyati/data_624/main/p2/StudentEvaluation.xlsx"
# Download the file to a temporary location
temp_file <- tempfile(fileext = ".xlsx")
download.file(file_url, destfile = temp_file, mode = "wb")
# Read the Excel file
test <- read_excel(temp_file)
# View the data
print(test)## # A tibble: 267 × 33
## `Brand Code` `Carb Volume` `Fill Ounces` `PC Volume` `Carb Pressure`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 D 5.48 24.0 0.27 65.4
## 2 A 5.39 24.0 0.227 63.2
## 3 B 5.29 23.9 0.303 66.4
## 4 B 5.27 23.9 0.186 64.8
## 5 B 5.41 24.2 0.16 69.4
## 6 B 5.29 24.1 0.212 73.4
## 7 A 5.48 23.9 0.243 65.2
## 8 B 5.42 24.1 0.123 67.4
## 9 A 5.41 23.9 0.333 66.8
## 10 D 5.47 24.0 0.256 72.6
## # ℹ 257 more rows
## # ℹ 28 more variables: `Carb Temp` <dbl>, PSC <dbl>, `PSC Fill` <dbl>,
## # `PSC CO2` <dbl>, `Mnf Flow` <dbl>, `Carb Pressure1` <dbl>,
## # `Fill Pressure` <dbl>, `Hyd Pressure1` <dbl>, `Hyd Pressure2` <dbl>,
## # `Hyd Pressure3` <dbl>, `Hyd Pressure4` <dbl>, `Filler Level` <dbl>,
## # `Filler Speed` <dbl>, Temperature <dbl>, `Usage cont` <dbl>,
## # `Carb Flow` <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>, …
There are 267 records and 33 columns in the student evaluation data (test data).
# Formatting Names to be more readable
colnames(train) <- sub(" ", "_", colnames(train))
paste("There are:", sum(is.na(train)), "total missing values in training data")## [1] "There are: 844 total missing values in training data"
## Brand_Code Carb_Volume Fill_Ounces PC_Volume
## 120 10 38 39
## Carb_Pressure Carb_Temp PSC PSC_Fill
## 27 26 33 23
## PSC_CO2 Mnf_Flow Carb_Pressure1 Fill_Pressure
## 39 2 32 22
## Hyd_Pressure1 Hyd_Pressure2 Hyd_Pressure3 Hyd_Pressure4
## 11 15 15 30
## Filler_Level Filler_Speed Temperature Usage_cont
## 20 57 14 5
## Carb_Flow Density MFR Balling
## 2 1 212 1
## Pressure_Vacuum PH Oxygen_Filler Bowl_Setpoint
## 0 4 12 2
## Pressure_Setpoint Air_Pressurer Alch_Rel Carb_Rel
## 12 0 9 10
## Balling_Lvl
## 1
colnames(test) <- sub(" ", "_", colnames(test))
paste("There are:", sum(is.na(test)), "total missing values in test data")## [1] "There are: 374 total missing values in test data"
## Brand_Code Carb_Volume Fill_Ounces PC_Volume
## 8 1 6 4
## Carb_Pressure Carb_Temp PSC PSC_Fill
## 0 1 5 3
## PSC_CO2 Mnf_Flow Carb_Pressure1 Fill_Pressure
## 5 0 4 2
## Hyd_Pressure1 Hyd_Pressure2 Hyd_Pressure3 Hyd_Pressure4
## 0 1 1 4
## Filler_Level Filler_Speed Temperature Usage_cont
## 2 10 2 2
## Carb_Flow Density MFR Balling
## 0 1 31 1
## Pressure_Vacuum PH Oxygen_Filler Bowl_Setpoint
## 1 267 3 1
## Pressure_Setpoint Air_Pressurer Alch_Rel Carb_Rel
## 2 1 3 2
## Balling_Lvl
## 0
In the training data set, Brand_Code has 120 missing records. MFR has 212 missing records. There are 844 total missing values in training data.
Below code provides summary statistics:
reshape_data <- train %>%
pivot_longer(!`Brand_Code`,names_to = "vars", values_to = "value")
summary_stats <- as.data.frame(reshape_data) %>%
group_by(vars) %>%
summarise(min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
std = sd(value, na.rm = TRUE),
var = var(value, na.rm = TRUE))
print(summary_stats,n = 32)## # A tibble: 32 × 7
## vars min max mean median std var
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Air_Pressurer 141. 148. 143. 143. 1.21 1.47e+0
## 2 Alch_Rel 5.28 8.62 6.90 6.56 0.505 2.55e-1
## 3 Balling -0.17 4.01 2.20 1.65 0.931 8.67e-1
## 4 Balling_Lvl 0 3.66 2.05 1.48 0.870 7.57e-1
## 5 Bowl_Setpoint 70 140 109. 120 15.3 2.34e+2
## 6 Carb_Flow 26 5104 2468. 3028 1074. 1.15e+6
## 7 Carb_Pressure 57 79.4 68.2 68.2 3.54 1.25e+1
## 8 Carb_Pressure1 106. 140. 123. 123. 4.74 2.25e+1
## 9 Carb_Rel 4.96 6.06 5.44 5.4 0.129 1.66e-2
## 10 Carb_Temp 129. 154 141. 141. 4.04 1.63e+1
## 11 Carb_Volume 5.04 5.7 5.37 5.35 0.106 1.13e-2
## 12 Density 0.24 1.92 1.17 0.98 0.378 1.43e-1
## 13 Fill_Ounces 23.6 24.3 24.0 24.0 0.0875 7.66e-3
## 14 Fill_Pressure 34.6 60.4 47.9 46.4 3.18 1.01e+1
## 15 Filler_Level 55.8 161. 109. 118. 15.7 2.46e+2
## 16 Filler_Speed 998 4030 3687. 3982 771. 5.94e+5
## 17 Hyd_Pressure1 -0.8 58 12.4 11.4 12.4 1.55e+2
## 18 Hyd_Pressure2 0 59.4 21.0 28.6 16.4 2.69e+2
## 19 Hyd_Pressure3 -1.2 50 20.5 27.6 16.0 2.55e+2
## 20 Hyd_Pressure4 52 142 96.3 96 13.1 1.72e+2
## 21 MFR 31.4 869. 704. 724 73.9 5.46e+3
## 22 Mnf_Flow -100. 229. 24.6 65.2 119. 1.43e+4
## 23 Oxygen_Filler 0.0024 0.4 0.0468 0.0334 0.0466 2.18e-3
## 24 PC_Volume 0.0793 0.478 0.277 0.271 0.0607 3.68e-3
## 25 PH 7.88 9.36 8.55 8.54 0.173 2.98e-2
## 26 PSC 0.002 0.27 0.0846 0.076 0.0493 2.43e-3
## 27 PSC_CO2 0 0.24 0.0564 0.04 0.0430 1.85e-3
## 28 PSC_Fill 0 0.62 0.195 0.18 0.118 1.39e-2
## 29 Pressure_Setpoint 44 52 47.6 46 2.04 4.16e+0
## 30 Pressure_Vacuum -6.6 -3.6 -5.22 -5.4 0.570 3.25e-1
## 31 Temperature 63.6 76.2 66.0 65.6 1.38 1.91e+0
## 32 Usage_cont 12.1 25.9 21.0 21.8 2.98 8.87e+0
# Generate density plot for each variable in the reshaped data
ggplot(reshape_data, aes(x = value)) +
geom_density(fill = "skyblue", alpha = 0.5) +
facet_wrap(~ vars, scales = "free") + # Creates a separate plot for each variable
theme_minimal() +
labs(title = "Density Plots for Observed Values by Variable",
x = "Observed Value",
y = "Density")## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_density()`).
# Select numeric predictor columns from the train dataframe
cols <- train %>%
select(-c('Brand_Code', 'PH')) %>% # Remove non-numeric and target variables
colnames()
# Create feature plot for the numeric predictors against the target PH
featurePlot(train[, cols],
train$PH,
plot = "scatter")## [1] 13
Below code creates binary indicator columns for each possible value of Brand_Code (A, B, C, D). Furthermore, we ensure that if there are any NA (missing) values in the binary columns (code_a, code_b, code_c, code_d), they are replaced with 0. This ensures that no missing values remain in these indicator columns, making the data cleaner for analysis. Finally, we remove the original Brand_Code column from the dataset. Next, we Keeps only the rows where the PH column does not have NA values. This ensures that any rows without a valid target value (PH) are excluded, which is important for training a model where the target variable is required.
# Creating binary indicators for each brand category
train <- train %>%
mutate(
code_a = ifelse(Brand_Code == "A", 1, 0),
code_b = ifelse(Brand_Code == "B", 1, 0),
code_c = ifelse(Brand_Code == "C", 1, 0),
code_d = ifelse(Brand_Code == "D", 1, 0)
)
# Replacing any NA values in the binary columns with 0
train <- train %>%
mutate(
code_a = replace_na(code_a, 0),
code_b = replace_na(code_b, 0),
code_c = replace_na(code_c, 0),
code_d = replace_na(code_d, 0)
)
# Removing the Brand_Code column
train <- select(train, -Brand_Code)
# Keeping only rows with complete data in the 'PH' column
train <- train %>%
filter(!is.na(PH))Below code imputes missing values in a dataset using two different methods: Bagging Imputation and KNN Imputation.
## missing value imputation
# Training Data
X_train <- subset(train,select = -`PH`)
X_train_df <- as.data.frame(X_train)
y_train <- train$PH
## Bag Imputation
preProc <- preProcess(X_train_df, method = c("bagImpute"))
X_train_imputed <- predict(preProc, X_train_df)
## KNN Imputation
preProc1 <- preProcess(X_train_df, method = c("knnImpute"))
X_train_imputed_2 <- predict(preProc1, X_train_df)Next, we shall split the imputed datasets (X_train_imputed and X_train_imputed_2).
set.seed(1234)
## split imputed data into train/test
X_train_imputed$y <- y_train
X_train_imputed_2$y <- y_train
sample <- sample.split(X_train_imputed, SplitRatio = 0.8)
train_final <- subset(X_train_imputed, sample == TRUE)
dev_test <- subset(X_train_imputed, sample == FALSE)
sample_2 <- sample.split(X_train_imputed_2, SplitRatio = 0.8)
train_final_2 <- subset(X_train_imputed_2, sample == TRUE)
dev_test_2 <- subset(X_train_imputed_2, sample == FALSE)windows.options(width = 40, height = 40)
cor_matrix <- cor(X_train_imputed)
corrplot(cor_matrix,title = "Correlation Plot", method = "square", outline = T, addgrid.col = "darkgray", order="hclust", mar = c(4,0,4,0), addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1, cl.cex = 1, type = "lower")# KNN
knn_model <- train(x = subset(train_final_2, select = -y),
y = train_final_2$y ,
method = "knn",
tuneLength = 10)
knn_model ## k-Nearest Neighbors
##
## 1997 samples
## 35 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1997, 1997, 1997, 1997, 1997, 1997, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.1329465 0.4399125 0.09703747
## 7 0.1302938 0.4529453 0.09661619
## 9 0.1297094 0.4543195 0.09700129
## 11 0.1295422 0.4541325 0.09748681
## 13 0.1298994 0.4505111 0.09816012
## 15 0.1301336 0.4481842 0.09863434
## 17 0.1305056 0.4447956 0.09921250
## 19 0.1309425 0.4410959 0.09983970
## 21 0.1314935 0.4362927 0.10029864
## 23 0.1319685 0.4323322 0.10095707
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
knn_predictions <- predict(knn_model , newdata= subset(dev_test_2, select = -y))
knn_accuracy <- postResample(pred = knn_predictions , obs = dev_test_2$y)
knn_accuracy ## RMSE Rsquared MAE
## 0.11294418 0.56213054 0.08753535
## Random Forest
random_forest_model <- randomForest(subset(train_final_2, select = -`y`), train_final_2$y)
random_forest_model ##
## Call:
## randomForest(x = subset(train_final_2, select = -y), y = train_final_2$y)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 11
##
## Mean of squared residuals: 0.01007175
## % Var explained: 66.43
rf_predictions <- predict(random_forest_model, subset(dev_test_2, select = -`y`))
rf_accuracy <- postResample(rf_predictions , dev_test_2$y)
rf_accuracy## RMSE Rsquared MAE
## 0.09058638 0.73403570 0.06782500
## GBM
gbm_parameter_grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
gbm_tuning <- train(y ~ ., data = train_final_2,
method = "gbm",
tuneGrid = gbm_parameter_grid ,
verbose = FALSE)gbm_predictions <- predict(gbm_tuning , subset(dev_test_2, select = -y))
gbm_accuracy <- postResample(gbm_predictions, dev_test_2$y)
gbm_accuracy ## RMSE Rsquared MAE
## 0.10245067 0.64407540 0.07834692
Tune a PLS model using a simple 10-fold cross validation resampling:
set.seed(100)
plsTuned <- train(y ~ ., data = train_final_2,
method = "pls",tuneLength = 30,
trControl = trainControl(method = "cv",number = 10),
preProc = c("center","scale"))
plsTuned## Partial Least Squares
##
## 1997 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1503426 0.2525026 0.1189976
## 2 0.1424523 0.3268504 0.1119158
## 3 0.1394648 0.3553281 0.1100331
## 4 0.1377914 0.3703116 0.1083002
## 5 0.1364919 0.3831704 0.1066698
## 6 0.1355944 0.3915996 0.1053951
## 7 0.1349298 0.3981059 0.1050866
## 8 0.1346732 0.4006188 0.1046466
## 9 0.1345273 0.4018981 0.1044203
## 10 0.1343412 0.4033304 0.1040158
## 11 0.1342041 0.4042434 0.1038793
## 12 0.1342310 0.4039503 0.1039904
## 13 0.1342257 0.4040724 0.1040268
## 14 0.1342422 0.4040293 0.1039981
## 15 0.1342801 0.4037919 0.1040619
## 16 0.1343214 0.4034121 0.1041119
## 17 0.1343345 0.4031636 0.1040913
## 18 0.1343817 0.4026604 0.1040865
## 19 0.1343621 0.4028022 0.1040402
## 20 0.1343945 0.4025858 0.1040724
## 21 0.1343819 0.4027516 0.1040356
## 22 0.1343738 0.4028550 0.1040125
## 23 0.1343301 0.4031859 0.1039783
## 24 0.1343183 0.4032771 0.1039521
## 25 0.1343084 0.4033492 0.1039451
## 26 0.1343002 0.4034023 0.1039419
## 27 0.1342889 0.4035037 0.1039452
## 28 0.1342875 0.4035176 0.1039452
## 29 0.1342811 0.4035659 0.1039402
## 30 0.1342795 0.4035769 0.1039386
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 11.
Make prediction on test data and evaluate PLS model performance:
pls_pred <- predict(plsTuned, newdata= subset(dev_test_2, select = -y))
pls_accuracy<-postResample(pred = pls_pred, obs = dev_test_2$y)
pls_accuracy## RMSE Rsquared MAE
## 0.1287502 0.4265420 0.1009250
Tune a PCR model using a simple 10-fold cross validation resampling:
set.seed(100)
pcrTuned <- train(y ~ ., data = train_final_2,
method = "pcr",
tuneLength = 30,
trControl = trainControl(method = "cv",number = 10),
preProc = c("center","scale"))
pcrTuned## Principal Component Analysis
##
## 1997 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.1718646 0.01762451 0.1371454
## 2 0.1586037 0.16720249 0.1258347
## 3 0.1570218 0.18294287 0.1244416
## 4 0.1570420 0.18290136 0.1241536
## 5 0.1571163 0.18188676 0.1241855
## 6 0.1512348 0.24234880 0.1191913
## 7 0.1478717 0.27454935 0.1169642
## 8 0.1479857 0.27353826 0.1170316
## 9 0.1462607 0.29116721 0.1158732
## 10 0.1460452 0.29339826 0.1158269
## 11 0.1452371 0.30077633 0.1150145
## 12 0.1451628 0.30149007 0.1150166
## 13 0.1450040 0.30320170 0.1149718
## 14 0.1446501 0.30701395 0.1147964
## 15 0.1435616 0.31784943 0.1141008
## 16 0.1438172 0.31606113 0.1141846
## 17 0.1428033 0.32530659 0.1131455
## 18 0.1417867 0.33345382 0.1117694
## 19 0.1420172 0.33123903 0.1119606
## 20 0.1400813 0.34812466 0.1107354
## 21 0.1390265 0.35843093 0.1095613
## 22 0.1390869 0.35802389 0.1096102
## 23 0.1391371 0.35798081 0.1098080
## 24 0.1381719 0.36828517 0.1087031
## 25 0.1379988 0.36968739 0.1084605
## 26 0.1369608 0.37894519 0.1072102
## 27 0.1370123 0.37889309 0.1072978
## 28 0.1370640 0.37843593 0.1073096
## 29 0.1345740 0.40114978 0.1043854
## 30 0.1344895 0.40165922 0.1044097
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 30.
Make prediction on test data and evaluate PCR model performance:
pcr_pred <- predict(pcrTuned, newdata= subset(dev_test_2, select = -y))
pcr_accuracy<-postResample(pred = pcr_pred, obs = dev_test_2$y)
pcr_accuracy## RMSE Rsquared MAE
## 0.1292489 0.4220536 0.1014411
Tune Neural Network model:
# Get highly correlated columns and remove them from training and test data
highly_correlated <- findCorrelation(cor(train_final_2),cutoff = .75)
train_Xnet <- train_final_2[,-highly_correlated]
test_Xnet <- dev_test_2[,-highly_correlated]
# Create candidate set of models to evaluate
nnetGrid <- expand.grid(.decay = c(0,0.01,0.1),.size = c(1:5),.bag = FALSE)
set.seed(100)
nnetTuned <- train(y~.,data = train_Xnet,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv",number = 10),
preProc = c("center","scale"),
linout = TRUE,trace = FALSE,MaxNWts = 5 * (ncol(train_Xnet)+ 1) + 5 + 1,maxit = 250)## Warning: executing %dopar% sequentially: no parallel backend registered
## Model Averaged Neural Network
##
## 1997 samples
## 25 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.1466581 0.3035138 0.11577404
## 0.00 2 0.2005945 0.2863020 0.11812740
## 0.00 3 0.1362004 0.3938649 0.10400028
## 0.00 4 0.1291882 0.4448982 0.09858460
## 0.00 5 0.1464837 0.4334126 0.09920246
## 0.01 1 0.1422052 0.3288780 0.11265563
## 0.01 2 0.1339945 0.4046356 0.10502836
## 0.01 3 0.1285895 0.4505014 0.09909846
## 0.01 4 0.1245089 0.4851793 0.09567046
## 0.01 5 0.1247789 0.4830863 0.09520996
## 0.10 1 0.1429982 0.3212229 0.11350535
## 0.10 2 0.1344134 0.4006948 0.10587831
## 0.10 3 0.1286357 0.4512212 0.09959212
## 0.10 4 0.1274502 0.4612314 0.09776311
## 0.10 5 0.1254353 0.4779160 0.09602596
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.
Make prediction on test data and evaluate Neural Network model performance:
nnet_pred <- predict(nnetTuned, newdata= subset(test_Xnet, select = -y))
nnet_accuracy<-postResample(pred = nnet_pred, obs = test_Xnet$y)
nnet_accuracy## RMSE Rsquared MAE
## 0.11698486 0.52905319 0.09214485
Tune MARS model using default resampling:
set.seed(100)
marsgrid <- expand.grid(degree = 1:3,nprune = 2:30)
marsTuned<-train(y ~ ., data = train_final_2,
method = "earth",
tuneGrid = marsgrid,
trControl = trainControl(method = "cv",number = 10),
preProc = c("center","scale"))## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Multivariate Adaptive Regression Spline
##
## 1997 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.1529547 0.2239762 0.11997116
## 1 3 0.1451487 0.3016040 0.11384292
## 1 4 0.1432637 0.3202380 0.11184079
## 1 5 0.1416102 0.3360887 0.11023236
## 1 6 0.1401369 0.3503129 0.10836513
## 1 7 0.1379969 0.3701586 0.10678623
## 1 8 0.1364008 0.3851537 0.10559972
## 1 9 0.1356149 0.3920051 0.10461549
## 1 10 0.1329486 0.4153091 0.10325588
## 1 11 0.1327544 0.4164348 0.10228196
## 1 12 0.1326903 0.4171781 0.10221403
## 1 13 0.1323324 0.4202533 0.10201317
## 1 14 0.1320768 0.4227396 0.10170424
## 1 15 0.1316303 0.4268054 0.10116537
## 1 16 0.1318719 0.4251477 0.10127711
## 1 17 0.1316430 0.4274757 0.10110835
## 1 18 0.1315941 0.4280534 0.10091550
## 1 19 0.1315941 0.4280534 0.10091550
## 1 20 0.1315941 0.4280534 0.10091550
## 1 21 0.1315941 0.4280534 0.10091550
## 1 22 0.1315941 0.4280534 0.10091550
## 1 23 0.1315941 0.4280534 0.10091550
## 1 24 0.1315941 0.4280534 0.10091550
## 1 25 0.1315941 0.4280534 0.10091550
## 1 26 0.1315941 0.4280534 0.10091550
## 1 27 0.1315941 0.4280534 0.10091550
## 1 28 0.1315941 0.4280534 0.10091550
## 1 29 0.1315941 0.4280534 0.10091550
## 1 30 0.1315941 0.4280534 0.10091550
## 2 2 0.1534183 0.2195273 0.12039606
## 2 3 0.1466813 0.2856167 0.11535622
## 2 4 0.1456379 0.2973107 0.11408554
## 2 5 0.1425876 0.3267202 0.11109689
## 2 6 0.1406668 0.3472048 0.10882931
## 2 7 0.1414177 0.3438076 0.10938939
## 2 8 0.1393029 0.3625760 0.10679149
## 2 9 0.1382944 0.3710933 0.10563762
## 2 10 0.1376960 0.3771047 0.10543621
## 2 11 0.1370275 0.3839179 0.10464815
## 2 12 0.1357681 0.3951972 0.10336700
## 2 13 0.1354655 0.3991030 0.10275233
## 2 14 0.1339245 0.4119980 0.10165412
## 2 15 0.1316917 0.4305050 0.09989991
## 2 16 0.1307698 0.4401823 0.09879455
## 2 17 0.1292417 0.4529989 0.09790942
## 2 18 0.1288209 0.4561648 0.09759841
## 2 19 0.1286744 0.4575672 0.09726496
## 2 20 0.1288687 0.4568837 0.09739498
## 2 21 0.1281908 0.4625339 0.09667287
## 2 22 0.1280102 0.4638934 0.09646312
## 2 23 0.1274060 0.4685275 0.09621850
## 2 24 0.1266720 0.4739322 0.09578023
## 2 25 0.1263278 0.4763104 0.09547193
## 2 26 0.1261539 0.4777784 0.09540930
## 2 27 0.1262246 0.4772829 0.09536583
## 2 28 0.1261877 0.4777085 0.09539824
## 2 29 0.1260255 0.4788105 0.09530204
## 2 30 0.1257873 0.4806389 0.09513345
## 3 2 0.1529547 0.2239762 0.11997116
## 3 3 0.1456463 0.2963680 0.11424491
## 3 4 0.1427016 0.3237116 0.11098700
## 3 5 0.1405485 0.3450519 0.10917268
## 3 6 0.1384343 0.3644972 0.10710921
## 3 7 0.1373972 0.3728565 0.10557776
## 3 8 0.1359429 0.3864454 0.10435467
## 3 9 0.1349495 0.3958971 0.10369222
## 3 10 0.1338498 0.4057464 0.10263162
## 3 11 0.1324577 0.4185147 0.10118773
## 3 12 0.1317300 0.4257988 0.10097386
## 3 13 0.1310221 0.4327902 0.09990873
## 3 14 0.1284114 0.4534883 0.09804658
## 3 15 0.1292054 0.4487715 0.09818156
## 3 16 0.1269489 0.4667162 0.09681454
## 3 17 0.1262097 0.4725133 0.09576426
## 3 18 0.1250709 0.4816390 0.09470195
## 3 19 0.1245805 0.4860489 0.09405047
## 3 20 0.1247285 0.4850913 0.09405229
## 3 21 0.1249612 0.4841608 0.09409856
## 3 22 0.1249451 0.4842656 0.09411946
## 3 23 0.1248127 0.4855871 0.09388316
## 3 24 0.1246949 0.4869685 0.09362219
## 3 25 0.1250028 0.4864160 0.09371541
## 3 26 0.1249981 0.4869485 0.09386094
## 3 27 0.1245131 0.4908756 0.09351394
## 3 28 0.1243361 0.4927872 0.09355877
## 3 29 0.1238487 0.4967387 0.09299816
## 3 30 0.1234745 0.4997570 0.09257993
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 30 and degree = 3.
Make prediction on test data and evaluate MARS model performance:
mars_pred <- predict(marsTuned, newdata= subset(dev_test_2, select = -y))
mars_accuracy<-postResample(pred = mars_pred, obs = dev_test_2$y)
mars_accuracy## RMSE Rsquared MAE
## 0.11718086 0.52620674 0.09066281
Tune SVM model:
set.seed(100)
svmTuned<-train(x = subset(train_final_2, select = -y),
y = train_final_2$y ,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmTuned## Support Vector Machines with Radial Basis Function Kernel
##
## 1997 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.1268100 0.4758660 0.09398379
## 0.50 0.1235019 0.4997336 0.09073704
## 1.00 0.1207803 0.5197304 0.08845277
## 2.00 0.1184667 0.5364172 0.08683304
## 4.00 0.1164352 0.5521370 0.08520787
## 8.00 0.1169365 0.5508894 0.08588098
## 16.00 0.1181413 0.5470169 0.08738437
## 32.00 0.1216116 0.5299452 0.09007273
## 64.00 0.1283433 0.4950676 0.09495553
## 128.00 0.1356823 0.4625062 0.10017520
## 256.00 0.1414263 0.4410341 0.10440948
## 512.00 0.1465595 0.4188510 0.10827227
## 1024.00 0.1489228 0.4092491 0.11028641
## 2048.00 0.1489228 0.4092491 0.11028641
##
## Tuning parameter 'sigma' was held constant at a value of 0.01971348
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01971348 and C = 4.
Make prediction on test data and evaluate SVM model performance:
svm_pred <- predict(svmTuned, newdata= subset(dev_test_2, select = -y))
svm_accuracy<-postResample(pred = svm_pred, obs = dev_test_2$y)
svm_accuracy## RMSE Rsquared MAE
## 0.10956812 0.58456969 0.08225389
Tune Recursive Partitioning Decision Tree Model:
set.seed(100)
rpartTune <- train(x = subset(train_final_2, select = -y),
y = train_final_2$y ,
method = "rpart2",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
rpartTune## CART
##
## 1997 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1798, 1796, 1798, 1796, 1797, 1798, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 0.1529747 0.2237745 0.1199919
## 2 0.1472369 0.2813990 0.1157326
## 3 0.1431181 0.3214345 0.1133946
## 4 0.1397949 0.3514484 0.1095457
## 5 0.1389555 0.3587993 0.1086642
## 6 0.1386227 0.3616622 0.1081520
## 7 0.1378278 0.3694417 0.1069444
## 8 0.1371662 0.3758798 0.1065626
## 10 0.1356283 0.3910793 0.1052888
## 12 0.1298444 0.4414623 0.1004344
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 12.
Make prediction on test set and evaluate performance of Recursive Partitioning Decision Tree model:
rpart_pred <- predict(rpartTune, newdata= subset(dev_test_2, select = -y))
rpart_accuracy<-postResample(pred=rpart_pred, obs = dev_test_2$y)
rpart_accuracy## RMSE Rsquared MAE
## 0.12291123 0.47715548 0.09575514
Tune Bagged Trees model:
set.seed(100)
baggedTree <- ipredbagg(train_final_2$y,subset(train_final_2, select = -y))
bagged_Pred <- predict(baggedTree, newdata= subset(dev_test_2, select = -y))
baggedTree_accuracy<-postResample(pred=rpart_pred, obs = dev_test_2$y)
baggedTree_accuracy## RMSE Rsquared MAE
## 0.12291123 0.47715548 0.09575514
Tune Cubist Model:
cubistGrid <- expand.grid(committees = c(1, 5, 10),
neighbors = c(0, 1, 3, 5))
set.seed(100)
cubistTune <- train(x = subset(train_final_2, select = -y),
y = train_final_2$y ,
method = "cubist",
preProc = c("center", "scale"),
tuneGrid = cubistGrid,
verbose = FALSE)
cubistTune## Cubist
##
## 1997 samples
## 35 predictor
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1997, 1997, 1997, 1997, 1997, 1997, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.1551279 0.3711975 0.10266237
## 1 1 0.1611767 0.3879031 0.10643571
## 1 3 0.1555072 0.3966772 0.10231619
## 1 5 0.1541102 0.3972435 0.10129013
## 5 0 0.1184657 0.5386691 0.08368411
## 5 1 0.1215098 0.5423594 0.08496936
## 5 3 0.1174746 0.5575404 0.08236944
## 5 5 0.1167658 0.5591217 0.08202837
## 10 0 0.1104652 0.5905058 0.07850673
## 10 1 0.1131453 0.5874095 0.07910953
## 10 3 0.1093231 0.6050411 0.07683539
## 10 5 0.1085769 0.6080797 0.07656626
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 5.
Make prediction on test set and evaluate performance of Cubist model:
cubist_pred <- predict(cubistTune, newdata= subset(dev_test_2, select = -y))
cubist_accuracy<-postResample(pred=cubist_pred, obs = dev_test_2$y)
cubist_accuracy## RMSE Rsquared MAE
## 0.08878050 0.72764140 0.06655013
# Combine results of all models into a single table
all_results<- rbind(
knn=knn_accuracy,
rf=rf_accuracy,
gbm=gbm_accuracy,
pls=pls_accuracy,
pcr=pcr_accuracy,
nnet=nnet_accuracy,
mars=mars_accuracy,
svm=svm_accuracy,
rpart = rpart_accuracy,
baggedTree=baggedTree_accuracy,
cubist= cubist_accuracy
)
# Convert to a data frame
results <- as.data.frame(all_results)
# See results
print(results)## RMSE Rsquared MAE
## knn 0.11294418 0.5621305 0.08753535
## rf 0.09058638 0.7340357 0.06782500
## gbm 0.10245067 0.6440754 0.07834692
## pls 0.12875018 0.4265420 0.10092503
## pcr 0.12924892 0.4220536 0.10144108
## nnet 0.11698486 0.5290532 0.09214485
## mars 0.11718086 0.5262067 0.09066281
## svm 0.10956812 0.5845697 0.08225389
## rpart 0.12291123 0.4771555 0.09575514
## baggedTree 0.12291123 0.4771555 0.09575514
## cubist 0.08878050 0.7276414 0.06655013
The Cubist model has a slightly lower RMSE (0.0888) than the Random Forest model (0.0906).This shows that the Cubist model has a slightly better prediction accuracy. However, the Random Forest model has a higher R-squared value (0.7340 vs. 0.7276). This indicates that the Random Forest model explains more variance in the data.
Considering the minor differences in these metrics, Random Forest is a better choice because it provides a better balance between prediction accuracy and explanatory power.
imp_scores <- varImp(random_forest_model)
imp_scores |>
mutate (var = rownames(imp_scores)) |>
ggplot(aes(Overall, reorder(var,Overall, sum), var)) +
geom_col(fill = 'blue') +
labs(title = 'Predictor Importance' , y = 'Predictors')imp_scores <- varImp(random_forest_model)
top10 <- head(imp_scores[order(-imp_scores$Overall), , drop = FALSE], 10)
top10## Overall
## Mnf_Flow 7.378791
## Usage_cont 4.399368
## code_c 2.878830
## Filler_Level 2.723815
## Oxygen_Filler 2.706484
## Bowl_Setpoint 2.613890
## Temperature 2.600523
## Carb_Rel 2.241133
## Pressure_Vacuum 2.158924
## Balling_Lvl 2.125748
# Get correlation matrix
corr_matrix <- cor(select(train, c("PH", row.names(top10))), use = "complete.obs")
corr_matrix## PH Mnf_Flow Usage_cont code_c Filler_Level
## PH 1.0000000 -0.447449397 -0.313766476 -0.295535701 0.32653555
## Mnf_Flow -0.4474494 1.000000000 0.519156997 -0.001914574 -0.57876422
## Usage_cont -0.3137665 0.519156997 1.000000000 0.008684086 -0.35034791
## code_c -0.2955357 -0.001914574 0.008684086 1.000000000 0.05317548
## Filler_Level 0.3265356 -0.578764217 -0.350347913 0.053175479 1.00000000
## Oxygen_Filler 0.1637180 -0.536994185 -0.306667782 0.037792339 0.22825209
## Bowl_Setpoint 0.3492594 -0.580000493 -0.364109851 0.052668762 0.94948508
## Temperature -0.1820321 -0.071607350 -0.102012924 0.198846869 0.06534243
## Carb_Rel 0.1669034 -0.029104955 -0.030272353 -0.240792365 0.10562619
## Pressure_Vacuum 0.2199029 -0.529363084 -0.300989266 -0.060884498 0.33204757
## Balling_Lvl 0.1098807 0.031960803 0.037975706 -0.221198022 0.04619991
## Oxygen_Filler Bowl_Setpoint Temperature Carb_Rel
## PH 0.16371802 0.34925942 -0.18203206 0.166903383
## Mnf_Flow -0.53699419 -0.58000049 -0.07160735 -0.029104955
## Usage_cont -0.30666778 -0.36410985 -0.10201292 -0.030272353
## code_c 0.03779234 0.05266876 0.19884687 -0.240792365
## Filler_Level 0.22825209 0.94948508 0.06534243 0.105626192
## Oxygen_Filler 1.00000000 0.23105759 0.13322643 -0.010143824
## Bowl_Setpoint 0.23105759 1.00000000 0.04208445 0.127269033
## Temperature 0.13322643 0.04208445 1.00000000 -0.106773288
## Carb_Rel -0.01014382 0.12726903 -0.10677329 1.000000000
## Pressure_Vacuum 0.22875128 0.34540601 0.03572839 -0.004106525
## Balling_Lvl -0.05128338 0.06468899 -0.17939526 0.850528168
## Pressure_Vacuum Balling_Lvl
## PH 0.219902868 0.10988070
## Mnf_Flow -0.529363084 0.03196080
## Usage_cont -0.300989266 0.03797571
## code_c -0.060884498 -0.22119802
## Filler_Level 0.332047572 0.04619991
## Oxygen_Filler 0.228751276 -0.05128338
## Bowl_Setpoint 0.345406013 0.06468899
## Temperature 0.035728385 -0.17939526
## Carb_Rel -0.004106525 0.85052817
## Pressure_Vacuum 1.000000000 -0.04022344
## Balling_Lvl -0.040223438 1.00000000
# Plot correlation matrix
corrplot(corr_matrix, method = "circle",mar = c(0,0,3, 0))
title ("Correlation between pH and top 10 predictors")Below we will examine the Random Forest models to see the best model to predict the PH values of the dataset ‘test_data’, where we have missing values for PH that we need to predict. We will need to preprocess the test dataset to match the structure of the training dataset, impute the missing values, and uses a trained Random Forest model to predict PH values for the test data. The ‘PH’ column, which is not present in the training data, is temporarily removed from the test dataset. The trained Random Forest model is then applied to predict PH values, and these predictions are appended to the dataset as a new column. Finally, the processed test dataset, including the predicted PH values, is saved to a CSV file
test_1 <- test
#Match the column names with training data
colnames(test_1) <- sub(" ", "_", colnames(test_1))
# Preprocess the test dataset to match the training dataset format
test_1 <- test_1 %>%
mutate(
code_a = ifelse(Brand_Code == "A", 1, 0),
code_b = ifelse(Brand_Code == "B", 1, 0),
code_c = ifelse(Brand_Code == "C", 1, 0),
code_d = ifelse(Brand_Code == "D", 1, 0)
) %>%
mutate(
code_a = replace_na(code_a, 0),
code_b = replace_na(code_b, 0),
code_c = replace_na(code_c, 0),
code_d = replace_na(code_d, 0)
) %>%
select(-Brand_Code)
test_1 <- as.data.frame(test_1)
# Impute missing values in the test dataset using Mean Imputation
test_imputed <- test_1 %>%
mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
#take PH away to match training data
test_imputed <- test_imputed %>% select(-PH)
# Predict PH values using the trained random forest model
test_imputed$PH <- predict(random_forest_model, newdata = test_imputed)
# View the predicted PH values
head(test_imputed$PH)## [1] 8.558511 8.561617 8.573331 8.573890 8.575387 8.574257
From here we can impute the new PH value generated into the original ‘test’ dataset. and export it as an excel file.
test_reverted <- test %>%
mutate(PH = test_imputed$PH) # Remove the temporary prediction column
# View the reverted dataset
tail(test_reverted)## # A tibble: 6 × 33
## Brand_Code Carb_Volume Fill_Ounces PC_Volume Carb_Pressure Carb_Temp PSC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 D 5.51 24.1 0.193 77 149. 0.096
## 2 B 5.39 24.0 0.223 66.4 138. 0.096
## 3 D 5.57 24.1 0.234 62.8 130 0.054
## 4 C 5.42 24.0 0.273 62.6 133. 0.128
## 5 C 5.25 24.1 0.361 68.2 145. 0.06
## 6 C 5.22 24.0 0.225 64.4 140 0.126
## # ℹ 26 more variables: PSC_Fill <dbl>, PSC_CO2 <dbl>, Mnf_Flow <dbl>,
## # Carb_Pressure1 <dbl>, Fill_Pressure <dbl>, Hyd_Pressure1 <dbl>,
## # Hyd_Pressure2 <dbl>, Hyd_Pressure3 <dbl>, Hyd_Pressure4 <dbl>,
## # Filler_Level <dbl>, Filler_Speed <dbl>, Temperature <dbl>,
## # Usage_cont <dbl>, Carb_Flow <dbl>, Density <dbl>, MFR <dbl>, Balling <dbl>,
## # Pressure_Vacuum <dbl>, PH <dbl>, Oxygen_Filler <dbl>, Bowl_Setpoint <dbl>,
## # Pressure_Setpoint <dbl>, Air_Pressurer <dbl>, Alch_Rel <dbl>, …
to see how accurate the model preform, we can compare a predicted PH values, with our known PH values from train data.
# Combine predictions from training and testing for comparison
train_final_2$Prediction <- predict(random_forest_model, newdata = train_final_2)
# Enhanced plot for observed vs predicted for training data
ggplot(data = train_final_2, aes(x = y, y = Prediction)) +
geom_point(color = "blue", alpha = 0.3, size = 1.5) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", size = 1) +
labs(title = "Observed vs Predicted (Training Data)",
x = "Observed PH",
y = "Predicted PH") +
theme_minimal() +
theme(panel.grid.major = element_line(color = "gray", linetype = "dotted"),
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 12))## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The plot of “Observed vs Predicted (Training Data)” shows a strong
positive correlation, as most points lie close to the diagonal red
dashed line, which represents a perfect prediction. This alignment
suggests that the Random Forest model has effectively captured the
relationship between the features and the target variable PH in the
training dataset. However, slight deviations are observed, particularly
at the extremes of the range, indicating minor inaccuracies in those
regions. The clustering of points around specific PH values also reveals
that the training data may have more examples in those ranges, resulting
in better predictions for those areas.