DATA 624 Project 2

Load the Training and Evaluation Datasets

Both datasets have consistent structures with 33 variables.

# Load training and test datasets with explicit missing value handling
train_set <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA-624/refs/heads/main/Project%202/StudentData.csv", na.strings = c("", "NA", "NULL"))
evaluation_set <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA-624/refs/heads/main/Project%202/StudentEvaluation.csv", na.strings = c("", "NA", "NULL"))

# Verify missing values in Brand.Code
cat("Number of missing values in Brand.Code (Training Dataset):", sum(is.na(train_set$Brand.Code)), "\n")

## Number of missing values in Brand.Code (Training Dataset): 120

cat("Number of missing values in Brand.Code (Evaluation Dataset):", sum(is.na(evaluation_set$Brand.Code)), "\n")

## Number of missing values in Brand.Code (Evaluation Dataset): 8

# Check the structure of the dataset to confirm the changes
str(train_set)

## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand.Code       : chr  "B" "A" "B" "A" ...
##  $ Carb.Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill.Ounces      : num  24 24 24.1 24 24.3 ...
##  $ PC.Volume        : num  0.263 0.239 0.263 0.293 0.111 ...
##  $ Carb.Pressure    : num  68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
##  $ Carb.Temp        : num  141 140 145 133 137 ...
##  $ PSC              : num  0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb.Pressure1   : num  119 122 120 115 118 ...
##  $ Fill.Pressure    : num  46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  NA NA NA 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : int  118 106 82 92 92 116 124 132 90 108 ...
##  $ Filler.Level     : num  121 119 120 118 119 ...
##  $ Filler.Speed     : int  4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : int  2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
##  $ Density          : num  0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  1.4 1.5 3.14 3.04 3.04 ...
##  $ Pressure.Vacuum  : num  -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Alch.Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling.Lvl      : num  1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...

cat("\n\n")

# Check the structure of the dataset to confirm the changes
str(evaluation_set)

## 'data.frame':    267 obs. of  33 variables:
##  $ Brand.Code       : chr  "D" "A" "B" "B" ...
##  $ Carb.Volume      : num  5.48 5.39 5.29 5.27 5.41 ...
##  $ Fill.Ounces      : num  24 24 23.9 23.9 24.2 ...
##  $ PC.Volume        : num  0.27 0.227 0.303 0.186 0.16 ...
##  $ Carb.Pressure    : num  65.4 63.2 66.4 64.8 69.4 73.4 65.2 67.4 66.8 72.6 ...
##  $ Carb.Temp        : num  135 135 140 139 142 ...
##  $ PSC              : num  0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
##  $ PSC.Fill         : num  0.4 0.22 0.1 0.2 0.3 0.22 0.14 0.1 0.48 0.1 ...
##  $ PSC.CO2          : num  0.04 0.08 0.02 0.02 0.06 NA 0 0.04 0.04 0.02 ...
##  $ Mnf.Flow         : num  -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
##  $ Carb.Pressure1   : num  117 119 120 125 115 ...
##  $ Fill.Pressure    : num  46 46.2 45.8 40 51.4 46.4 46.2 40 43.8 40.8 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : int  96 112 98 132 94 94 108 108 110 106 ...
##  $ Filler.Level     : num  129 120 119 120 116 ...
##  $ Filler.Speed     : int  3986 4012 4010 NA 4018 4010 4010 NA 4010 1006 ...
##  $ Temperature      : num  66 65.6 65.6 74.4 66.4 66.6 66.8 NA 65.8 66 ...
##  $ Usage.cont       : num  21.7 17.6 24.2 18.1 21.3 ...
##  $ Carb.Flow        : int  2950 2916 3056 28 3214 3064 3042 1972 2502 28 ...
##  $ Density          : num  0.88 1.5 0.9 0.74 0.88 0.84 1.48 1.6 1.52 1.48 ...
##  $ MFR              : num  728 736 735 NA 752 ...
##  $ Balling          : num  1.4 2.94 1.45 1.06 1.4 ...
##  $ Pressure.Vacuum  : num  -3.8 -4.4 -4.2 -4 -4 -3.8 -4.2 -4.4 -4.4 -4.2 ...
##  $ PH               : logi  NA NA NA NA NA NA ...
##  $ Oxygen.Filler    : num  0.022 0.03 0.046 NA 0.082 0.064 0.042 0.096 0.046 0.096 ...
##  $ Bowl.Setpoint    : int  130 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  45.2 46 46 46 50 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 147 147 146 146 ...
##  $ Alch.Rel         : num  6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
##  $ Carb.Rel         : num  5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
##  $ Balling.Lvl      : num  1.48 3.04 1.46 1.48 1.46 1.44 3.02 3 3.1 3.12 ...

Summarize numerical variables and investigate missing values

Notable Observations:

Mnf.Flow: Highly skewed with a wide range (-100.20 to 229.40) and a very high standard deviation (119.48), indicating significant variability.
Carb.Flow: Extremely skewed with a wide range (26.00 to 5104.00) and a high standard deviation (1073.70), suggesting potential outliers.
Filler.Speed: Large spread (998.00 to 4030.00) with a substantial standard deviation (770.82), indicating significant variability across samples.
Hyd.Pressure1, Hyd.Pressure2, Hyd.Pressure3: All exhibit high skewness with notable ranges and standard deviations (e.g., Hyd.Pressure1: -0.80 to 58.00, SD = 12.43), likely requiring transformation.
MFR: Missing values are notable (8.25%), with a moderately wide range (31.40 to 868.60) and standard deviation (73.90), indicating some variability.

Recommendations:

Apply transformations (e.g., log or Box-Cox) for skewed variables (Carb.Flow, Mnf.Flow, Hyd.Pressure*) to normalize distributions.
Investigate outliers for variables with large ranges and high variability (Carb.Flow, Filler.Speed).
Address missingness in MFR due to its notable percentage (8.25%).

# Load necessary libraries
# install.packages('kableExtra', repos='http://cran.rstudio.com/')

library(kableExtra)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)


# Create vectors of numeric and categorical variables for insurance_training
numeric_vars <- names(train_set)[sapply(train_set, is.numeric)]
categorical_vars <- names(train_set)[sapply(train_set, is.factor)]

# Correctly select numerical variables using the predefined numeric_vars
numerical_vars <- train_set %>%
  dplyr::select(all_of(numeric_vars))


# Compute statistical summary including missing value counts and percentages
statistical_summary <- numerical_vars %>%
  summarise(across(
    everything(),
    list(
      Min = ~round(min(., na.rm = TRUE), 2),
      Q1 = ~round(quantile(., 0.25, na.rm = TRUE), 2),
      Mean = ~round(mean(., na.rm = TRUE), 2),
      Median = ~round(median(., na.rm = TRUE), 2),
      Q3 = ~round(quantile(., 0.75, na.rm = TRUE), 2),
      Max = ~round(max(., na.rm = TRUE), 2),
      SD = ~round(sd(., na.rm = TRUE), 2),
      Missing = ~sum(is.na(.)), # Count of missing values
      PercentMissing = ~round(mean(is.na(.)) * 100, 2) # Percentage of missing values
    ),
    .names = "{.col}_{.fn}"
  )) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("Variable", ".value"),
    names_pattern = "^(.*)_(.*)$"
  )

# Display the resulting summary table
statistical_summary %>%
  kable(caption = "Summary of Numerical Variables (Including Missing Counts and Percentages)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

Summary of Numerical Variables (Including Missing Counts and Percentages)
Variable	Min	Q1	Mean	Median	Q3	Max	SD	Missing	PercentMissing
Carb.Volume	5.04	5.29	5.37	5.35	5.45	5.70	0.11	10	0.39
Fill.Ounces	23.63	23.92	23.97	23.97	24.03	24.32	0.09	38	1.48
PC.Volume	0.08	0.24	0.28	0.27	0.31	0.48	0.06	39	1.52
Carb.Pressure	57.00	65.60	68.19	68.20	70.60	79.40	3.54	27	1.05
Carb.Temp	128.60	138.40	141.09	140.80	143.80	154.00	4.04	26	1.01
PSC	0.00	0.05	0.08	0.08	0.11	0.27	0.05	33	1.28
PSC.Fill	0.00	0.10	0.20	0.18	0.26	0.62	0.12	23	0.89
PSC.CO2	0.00	0.02	0.06	0.04	0.08	0.24	0.04	39	1.52
Mnf.Flow	-100.20	-100.00	24.57	65.20	140.80	229.40	119.48	2	0.08
Carb.Pressure1	105.60	119.00	122.59	123.20	125.40	140.20	4.74	32	1.24
Fill.Pressure	34.60	46.00	47.92	46.40	50.00	60.40	3.18	22	0.86
Hyd.Pressure1	-0.80	0.00	12.44	11.40	20.20	58.00	12.43	11	0.43
Hyd.Pressure2	0.00	0.00	20.96	28.60	34.60	59.40	16.39	15	0.58
Hyd.Pressure3	-1.20	0.00	20.46	27.60	33.40	50.00	15.98	15	0.58
Hyd.Pressure4	52.00	86.00	96.29	96.00	102.00	142.00	13.12	30	1.17
Filler.Level	55.80	98.30	109.25	118.40	120.00	161.20	15.70	20	0.78
Filler.Speed	998.00	3888.00	3687.20	3982.00	3998.00	4030.00	770.82	57	2.22
Temperature	63.60	65.20	65.97	65.60	66.40	76.20	1.38	14	0.54
Usage.cont	12.08	18.36	20.99	21.79	23.75	25.90	2.98	5	0.19
Carb.Flow	26.00	1144.00	2468.35	3028.00	3186.00	5104.00	1073.70	2	0.08
Density	0.24	0.90	1.17	0.98	1.62	1.92	0.38	1	0.04
MFR	31.40	706.30	704.05	724.00	731.00	868.60	73.90	212	8.25
Balling	-0.17	1.50	2.20	1.65	3.29	4.01	0.93	1	0.04
Pressure.Vacuum	-6.60	-5.60	-5.22	-5.40	-5.00	-3.60	0.57	0	0.00
PH	7.88	8.44	8.55	8.54	8.68	9.36	0.17	4	0.16
Oxygen.Filler	0.00	0.02	0.05	0.03	0.06	0.40	0.05	12	0.47
Bowl.Setpoint	70.00	100.00	109.33	120.00	120.00	140.00	15.30	2	0.08
Pressure.Setpoint	44.00	46.00	47.62	46.00	50.00	52.00	2.04	12	0.47
Air.Pressurer	140.80	142.20	142.83	142.60	143.00	148.20	1.21	0	0.00
Alch.Rel	5.28	6.54	6.90	6.56	7.24	8.62	0.51	9	0.35
Carb.Rel	4.96	5.34	5.44	5.40	5.54	6.06	0.13	10	0.39
Balling.Lvl	0.00	1.38	2.05	1.48	3.14	3.66	0.87	1	0.04

Visualize missing values

High Missingness (e.g., MFR ~8%): Use predictive imputation methods like MICE (Multivariate Imputation by Chained Equations) or KNN to estimate values based on patterns in other variables.
Low to Moderate Missingness (<5%):
- Continuous Variables: Use mean imputation for normally distributed data and median imputation for skewed data (e.g., PC.Volume, Fill.Ounces).

library(ggplot2)
library(dplyr)

# Prepare data for missing values
missing_data <- train_set %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing") %>%
  mutate(PercentMissing = (Missing / nrow(train_set)) * 100) %>%
  filter(Missing > 0) # Include only variables with missing values

# Create the flipped bar chart
ggplot(missing_data, aes(x = reorder(Variable, PercentMissing), y = PercentMissing)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Missing Values by Variable",
    x = "Variable",
    y = "Percentage of Missing Values (%)"
  ) +
  theme_minimal()

Summary of categorical variables

The Brand.Code variable has 5 unique levels, no missing values, and a mode of B (48.19%). It shows moderate entropy (1.93) and a high imbalance ratio (10.32), indicating a skewed distribution.

library(dplyr)
library(knitr)

# Select categorical columns including Brand.Code
categorical_columns <- train_set %>%
  dplyr::select(all_of(c(categorical_vars, "Brand.Code")))

# Function to calculate Shannon entropy
calculate_entropy <- function(counts) {
  proportions <- counts / sum(counts)
  entropy <- -sum(proportions * log2(proportions), na.rm = TRUE)
  return(entropy)
}

# Function to calculate imbalance ratio
calculate_imbalance_ratio <- function(counts) {
  max_count <- max(counts, na.rm = TRUE)
  min_count <- min(counts, na.rm = TRUE)
  if (min_count == 0) {
    imbalance_ratio <- Inf  # Avoid division by zero
  } else {
    imbalance_ratio <- max_count / min_count
  }
  return(imbalance_ratio)
}

# Ensure all levels for each variable are included
complete_levels <- function(var, data) {
  unique_levels <- unique(data[[var]])
  factor(unique_levels, levels = unique_levels)
}

# Compute the summary for each categorical variable
categorical_summary <- lapply(names(categorical_columns), function(var) {
  # Ensure all levels are accounted for, even with 0 counts
  summary_df <- train_set %>%
    count(!!sym(var), .drop = FALSE) %>%
    complete(!!sym(var) := unique(train_set[[var]]), fill = list(n = 0)) %>%
    mutate(Percentage = round(n / sum(n) * 100, 2)) %>%
    rename(Level = !!sym(var), Count = n) %>%
    mutate(Variable = var)  # Add the variable name for identification
  
  # Compute the mode for the variable
  mode_row <- summary_df %>%
    filter(Count == max(Count, na.rm = TRUE)) %>%
    slice_head(n = 1) %>%  # Use `slice_head()` to handle ties safely
    pull(Level)
  
  # Compute percentage for the mode
  mode_percentage <- summary_df %>%
    filter(Level == mode_row) %>%
    pull(Percentage) %>%
    first()  # Ensure it works even if there are multiple matches
  
  # Count missing values for the variable
  missing_count <- sum(is.na(train_set[[var]]))
  
  # Count unique levels
  unique_levels_count <- n_distinct(train_set[[var]])
  
  # Compute entropy
  entropy <- calculate_entropy(summary_df$Count)
  
  # Compute imbalance ratio
  imbalance_ratio <- calculate_imbalance_ratio(summary_df$Count)
  
  # Combine into a single row summary for the variable
  final_row <- data.frame(
    Variable = var,
    Mode = as.character(mode_row),  # Ensure Mode is always a character
    Mode_Percentage = mode_percentage,
    Missing_Count = missing_count,
    Unique_Levels = unique_levels_count,
    Entropy = round(entropy, 2),
    Imbalance_Ratio = round(imbalance_ratio, 2),
    stringsAsFactors = FALSE  # Avoid factors unless explicitly needed
  )
  
  return(final_row)
})

# Combine summaries into a single data frame
categorical_summary_df <- bind_rows(categorical_summary)

# Print the resulting summary
categorical_summary_df %>%
  kable(caption = "Summary of Categorical Variables (Including Missing Counts and Percentages)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

Summary of Categorical Variables (Including Missing Counts and Percentages)
Variable	Mode	Mode_Percentage	Missing_Count	Unique_Levels	Entropy	Imbalance_Ratio
Brand.Code	B	48.19	120	5	1.93	10.32

Investigate relationship between Brand Code and PH

There is a statistically significant relationship between Brand_Code and PH based on the provided analyses. We will therefore retain Brand.Code as a variables:

Summary Statistics

The mean, median, and standard deviation of PH vary across the different levels of Brand_Code, indicating differences in central tendency and variability.

ANOVA

The ANOVA results show a significant F-statistic with a p-value less than 0.05 (\(p < 2.2 \times 10^{-16}\)), which suggests that there are statistically significant differences in the mean PH values across the levels of Brand_Code.

Boxplots

The boxplots illustrate visible differences in the distributions of PH for each Brand_Code. These differences reinforce the findings from the summary statistics and ANOVA.

Chi-Square Test (Categorized PH)

When PH is categorized into levels such as “Low,” “Medium,” and “High,” the Chi-Square test also indicates a significant association (\(p < 2.2 \times 10^{-16}\)). However, the warning about the Chi-Square approximation suggests caution in interpretation, potentially due to sparse data in some categories.

Conclusion

The statistical evidence supports a relationship between Brand_Code and PH. These findings could inform further modeling, such as including Brand_Code as a categorical predictor in statistical or machine learning models.

Step 1: Statistical Summary by Group

# Summary of PH by Brand Code
summary_stats <- aggregate(PH ~ Brand.Code, 
                           data = train_set, 
                           FUN = function(x) c(mean = mean(x), 
                                               median = median(x), 
                                               sd = sd(x)))

# Convert the result into a more readable data frame
summary_df <- do.call(data.frame, summary_stats)

# Rename the columns for better understanding
colnames(summary_df) <- c("Brand_Code", "PH_Mean", "PH_Median", "PH_SD")

# Print the summary
summary_df %>%
  kable(caption = "Summary of PH by Brand Code") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

Summary of PH by Brand Code
Brand_Code	PH_Mean	PH_Median	PH_SD
A	8.497406	8.52	0.1630744
B	8.566785	8.56	0.1689752
C	8.413684	8.42	0.1771266
D	8.602504	8.62	0.1353614

Step 2: Perform ANOVA

# Perform ANOVA
anova_result <- aov(PH ~ Brand.Code, data = train_set)
summary(anova_result)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## Brand.Code     3   8.50  2.8322   108.5 <2e-16 ***
## Residuals   2443  63.76  0.0261                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 124 observations deleted due to missingness

Step 3: Create Boxplots

# Visualize using boxplots
library(ggplot2)
ggplot(train_set, aes(x = Brand.Code, y = PH)) +
  geom_boxplot() +
  labs(title = "Boxplot of PH by Brand Code", x = "Brand Code", y = "PH") +
  theme_minimal()

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Step 4: Chi-Square Test (If PH is Categorized)

# Convert PH to categorical (if needed)
train_set$PH_cat <- cut(train_set$PH, breaks = 3, labels = c("Low", "Medium", "High"))

# Function to calculate mode
get_mode <- function(x) {
  ux <- na.omit(unique(x))
  ux[which.max(tabulate(match(x, ux)))]
}

# Impute missing values in PH_cat with the mode
mode_PH_cat <- get_mode(train_set$PH_cat)
train_set$PH_cat[is.na(train_set$PH_cat)] <- mode_PH_cat


# Perform Chi-Square Test between Brand.Code and PH_cat
chisq_test <- chisq.test(table(train_set$Brand.Code, train_set$PH_cat))

## Warning in chisq.test(table(train_set$Brand.Code, train_set$PH_cat)):
## Chi-squared approximation may be incorrect

print(chisq_test)

## 
##  Pearson's Chi-squared test
## 
## data:  table(train_set$Brand.Code, train_set$PH_cat)
## X-squared = 219.69, df = 6, p-value < 2.2e-16

Impute Missing Values

The imputation methods applied in the script are outlined below, incorporating the specialized treatment of the Brand_Code variable and handling scenarios where the target variable (PH) is unavailable in the evaluation set:

Imputation Methods Summary:

Numeric Variables:
- Predictive Mean Matching (PMM) was used for numeric variables, implemented via the mice package.
- Missing values were imputed by predicting plausible values based on observed data patterns, ensuring realistic and consistent imputations across variables.
Categorical Variables (Brand_Code):
- Brand_Code:
  - For the training set, missing values in Brand_Code were imputed using a multinomial logistic regression model with PH as the predictor. This method leverages the observed relationship between Brand_Code and PH to provide accurate and contextually appropriate imputations.
  - For the evaluation set (where PH is unavailable), missing values in Brand_Code were imputed using mode-based imputation. The most frequent category from the training data was assigned to ensure consistency while addressing the absence of a predictor.
Exclusions:
- The variable PH was explicitly excluded from imputation in the evaluation_set as its values are missing by design, representing the target variable to be predicted.
- Any remaining missing values for PH were excluded from the final missing values report, as they are not subject to imputation.

# Check and install necessary packages
necessary_packages <- c("mice", "dplyr", "nnet")
for (pkg in necessary_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load the required libraries
library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(dplyr)
library(nnet)

# Function to impute numeric variables using MICE
impute_with_mice <- function(data, exclude_vars = NULL) {
  numeric_data <- data %>% select_if(is.numeric)
  set.seed(123)  # For reproducibility
  imputed_data <- mice(numeric_data, m = 1, method = "pmm", maxit = 5, printFlag = FALSE)
  data[, names(numeric_data)] <- complete(imputed_data)
  return(data)
}

# Function to impute Brand.Code using Multinomial Logistic Regression
impute_brand_code <- function(data, target_col, predictor_col) {
  # Ensure Brand.Code is a factor
  data[[target_col]] <- factor(data[[target_col]])
  
  # Filter data with non-missing values
  train_data_non_missing <- data[!is.na(data[[target_col]]), ]
  
  # Check if sufficient classes are present
  if (length(unique(train_data_non_missing[[target_col]])) > 1) {
    # Train multinomial logistic regression
    model <- multinom(as.formula(paste(target_col, "~", predictor_col)), data = train_data_non_missing)
    
    # Predict missing values
    missing_indices <- is.na(data[[target_col]])
    data[[target_col]][missing_indices] <- predict(model, newdata = data[missing_indices, ])
    
    # Ensure the factor levels are consistent
    data[[target_col]] <- factor(data[[target_col]], levels = levels(train_data_non_missing[[target_col]]))
  } else {
    # Use the mode class if only one class is present
    warning("Only one class detected for imputation. Using mode imputation.")
    mode_class <- names(sort(table(train_data_non_missing[[target_col]]), decreasing = TRUE))[1]
    data[[target_col]][is.na(data[[target_col]])] <- mode_class
  }
  
  return(data)
}

# Function to impute Brand.Code for datasets without PH values
impute_brand_code_no_ph <- function(data, target_col, train_data) {
  # Ensure Brand.Code is a factor
  train_data[[target_col]] <- factor(train_data[[target_col]])
  
  # Use the mode of Brand.Code from the training data
  mode_class <- names(sort(table(train_data[[target_col]]), decreasing = TRUE))[1]
  data[[target_col]][is.na(data[[target_col]])] <- mode_class
  data[[target_col]] <- factor(data[[target_col]], levels = levels(train_data[[target_col]]))
  
  return(data)
}

# Create copies of train_set and evaluation_set
train_set_imputed <- train_set
evaluation_set_imputed <- evaluation_set

# Step 1: Impute numeric variables
train_set_imputed <- impute_with_mice(train_set_imputed)
evaluation_set_imputed <- impute_with_mice(evaluation_set_imputed)

# Step 2: Impute Brand.Code
train_set_imputed <- impute_brand_code(train_set_imputed, target_col = "Brand.Code", predictor_col = "PH")

## # weights:  12 (6 variable)
## initial  value 3397.807479 
## iter  10 value 2806.215098
## final  value 2804.972880 
## converged

# Use mode-based imputation for the evaluation set as it lacks PH values
evaluation_set_imputed <- impute_brand_code_no_ph(evaluation_set_imputed, target_col = "Brand.Code", train_data = train_set_imputed)

# Step 3: Check for any remaining missing values
check_missing <- function(data) {
  missing_summary <- colSums(is.na(data))
  missing_summary <- missing_summary[missing_summary > 0]
  return(missing_summary)
}

# Check missing values
cat("Missing values in train_set_imputed:\n")

## Missing values in train_set_imputed:

print(check_missing(train_set_imputed))

## named numeric(0)

cat("\nMissing values in evaluation_set_imputed:\n")

## 
## Missing values in evaluation_set_imputed:

print(check_missing(evaluation_set_imputed))

##  PH 
## 267

# Structure of imputed datasets
cat("\nStructure of Train Set Imputed:\n")

## 
## Structure of Train Set Imputed:

# Drop PH_cat column from train_set_imputed
if ("PH_cat" %in% colnames(train_set_imputed)) {
  train_set_imputed <- train_set_imputed %>% dplyr::select(-PH_cat)
  cat("PH_cat column has been removed from train_set_imputed.\n")
} else {
  cat("PH_cat column does not exist in train_set_imputed.\n")
}

## PH_cat column has been removed from train_set_imputed.

Visualize Imputed dataset

Data Distribution and Characteristics

The imputed train dataset was analyzed to understand the distributions, skewness, and presence of outliers among its numeric variables. The goal was to identify patterns in the data, such as normality, skewed distributions, and extreme values, while also determining variables with near-zero variance that provide limited analytical value. Below is a summary of the findings:

Distribution

Approximately Normal: Variables such as “Carb.Volume,” “Fill.Ounces,” and “PSC” exhibit a roughly symmetric distribution, indicating a normal-like pattern.
Skewed: Variables like “Carb.Flow,” “Hyd.Pressure3,” “Density,” and “Balling” display significant asymmetry, indicating skewness in the data.
Multimodal: Variables such as “MFR” and “Filler.Speed” exhibit multiple peaks, suggesting the presence of distinct subgroups or clusters in the data.

Skewness

Right-Skewed: Variables like “Carb.Flow,” “Hyd.Pressure3,” “MFR,” and “Usage.cont” have a longer tail on the right side, indicating higher values are less frequent.
Left-Skewed: Variables such as “PSC.CO2,” “Carb.Rel,” and “PH” have a longer tail on the left side, with lower values being less frequent.

Outliers

Significant Outliers: Variables including “MFR,” “Filler.Speed,” “Oxygen.Filler,” and “Bowl.Setpoint” have extreme values that deviate from the bulk of the data. These may require further investigation or transformation.

Excluded Variables (Near-Zero Variance)

Variables like “Carb.Temp” and “Hyd.Pressure1” have extremely low variance, indicating almost no variability across their observations. These were excluded from visualization as they provide limited information for analysis.

# Install and load required libraries
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}

library(caret)

## Loading required package: lattice

library(dplyr)
library(tidyr)

# Reshape the imputed train_set to long format for numeric columns
numeric_cols <- sapply(train_set_imputed, is.numeric)
train_set_long <- pivot_longer(train_set_imputed,
                               cols = all_of(names(train_set_imputed)[numeric_cols]),
                               names_to = "variable",
                               values_to = "value")

# Identify variables with near-zero variance using caret::nearZeroVar
nzv_indices <- nearZeroVar(train_set_imputed, saveMetrics = TRUE)

# Extract variables with near-zero variance
nzv_vars <- rownames(nzv_indices[nzv_indices$nzv == TRUE, ])

# Output the list of variables with near-zero variance
cat("Variables with near-zero variance (not plotted):\n")

## Variables with near-zero variance (not plotted):

print(nzv_vars)

## [1] "Hyd.Pressure1"

# Filter out variables with near-zero variance from the long-format data
train_set_long_filtered <- train_set_long %>%
  filter(!variable %in% nzv_vars)

# Clip extreme values to the 1st and 99th percentiles for better visualization
train_set_long_filtered <- train_set_long_filtered %>%
  group_by(variable) %>%
  mutate(value = pmin(pmax(value, quantile(value, 0.01, na.rm = TRUE)), 
                      quantile(value, 0.99, na.rm = TRUE))) %>%
  ungroup()

# Prepare the list of numeric variables for separate plotting
numeric_variables <- unique(train_set_long_filtered$variable)

# Set up the plotting area for histograms and boxplots
par(mfrow = c(1, 2))  # 2 columns for histogram and boxplot side-by-side

# Loop through each numeric variable to plot
for (var in numeric_variables) {
  # Filter data for the current variable
  var_data <- train_set_long_filtered %>% filter(variable == var) %>% pull(value)
  
  # Plot histogram
  hist(var_data, main = paste("Histogram of", var), 
       xlab = var, breaks = 30, col = "lightblue", border = "black")
  
  # Plot boxplot
  boxplot(var_data, main = paste("Boxplot of", var), 
          horizontal = TRUE, col = "lightgreen")
}

# Reset plotting layout to default
par(mfrow = c(1, 1))

Preprocessing for Machine Learning: Handling Missing Values and Variance Stabilization

Data Preparation and Yeo-Johnson Transformation

To prepare the datasets for statistical and machine learning models, the following steps were applied:

Missing Values:
- Missing values in key variables, such as Brand.Code, were identified and appropriately imputed to ensure data completeness.
Yeo-Johnson Transformation:
- The Yeo-Johnson transformation was applied to numeric variables (excluding PH and categorical variables) to handle both positive and negative values. This transformation stabilizes variance and improves normality, enhancing model performance and robustness.

These preprocessing steps ensure the datasets are ready for accurate and reliable analysis in statistical and machine learning models.

# Load necessary libraries
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}

library(caret)   # For preProcess and Yeo-Johnson transformation
library(dplyr)   # For data manipulation

# Specify columns to exclude (e.g., PH, Brand.Code)
exclude_cols <- c("Brand.Code", "PH")

# Create copies of the imputed datasets for transformation
train_set_transformed <- train_set_imputed
evaluation_set_transformed <- evaluation_set_imputed

# Identify numeric columns to process in the train and evaluation sets
numeric_cols_train <- setdiff(
  names(train_set_imputed)[sapply(train_set_imputed, is.numeric)], 
  exclude_cols
)
numeric_cols_test <- setdiff(
  names(evaluation_set_imputed)[sapply(evaluation_set_imputed, is.numeric)], 
  exclude_cols
)

# Process each numeric column in the train set
for (col in numeric_cols_train) {
  tryCatch({
    # Ensure the column is numeric
    if (!is.numeric(train_set_imputed[[col]])) {
      stop(paste("Column", col, "is not numeric in train_set_imputed"))
    }
    
    # Perform Yeo-Johnson transformation
    yeo_johnson_result <- caret::preProcess(
      as.data.frame(train_set_imputed[[col]]), 
      method = "YeoJohnson"
    )
    
    # Apply transformation to the train set
    train_set_transformed[[col]] <- predict(
      yeo_johnson_result, 
      as.data.frame(train_set_imputed[[col]])
    )[, 1]
    
  }, error = function(e) {
    cat(paste("Error processing column", col, ":", e$message, "\n"))
  })
}

# Process each numeric column in the evaluation set
for (col in numeric_cols_test) {
  tryCatch({
    # Ensure the column is numeric
    if (!is.numeric(evaluation_set_imputed[[col]])) {
      stop(paste("Column", col, "is not numeric in evaluation_set_imputed"))
    }
    
    # Perform Yeo-Johnson transformation
    yeo_johnson_result <- caret::preProcess(
      as.data.frame(evaluation_set_imputed[[col]]), 
      method = "YeoJohnson"
    )
    
    # Apply transformation to the evaluation set
    evaluation_set_transformed[[col]] <- predict(
      yeo_johnson_result, 
      as.data.frame(evaluation_set_imputed[[col]])
    )[, 1]
    
  }, error = function(e) {
    cat(paste("Error processing column", col, ":", e$message, "\n"))
  })
}

# Output the structure of transformed train and evaluation sets
cat("\n\nStructure of Transformed Train Set (Yeo-Johnson Applied):\n\n")

## 
## 
## Structure of Transformed Train Set (Yeo-Johnson Applied):

str(train_set_transformed)

## 'data.frame':    2571 obs. of  33 variables:
##  $ Brand.Code       : Factor w/ 4 levels "A","B","C","D": 2 1 2 1 1 1 1 2 2 2 ...
##  $ Carb.Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill.Ounces      : num  729 732 735 732 752 ...
##  $ PC.Volume        : num  0.208 0.193 0.208 0.227 0.1 ...
##  $ Carb.Pressure    : num  2.14 2.14 2.15 2.12 2.14 ...
##  $ Carb.Temp        : num  0.5 0.5 0.5 0.5 0.5 ...
##  $ PSC              : num  0.104 0.124 0.09 0.072 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.169 0.152 0.199 0.222 0.121 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -114 -114 -114 -114 -114 ...
##  $ Carb.Pressure1   : num  21.2 21.5 21.4 20.9 21.2 ...
##  $ Fill.Pressure    : num  1.41 1.41 1.41 1.41 1.41 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  2.64 2.61 2.54 2.57 2.57 ...
##  $ Filler.Level     : num  121 119 120 118 119 ...
##  $ Filler.Speed     : int  4002 3986 4020 4012 4010 4014 1010 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : num  24689 27033 24492 26121 26032 ...
##  $ Density          : num  0.453 0.463 0.584 0.579 0.579 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  0.628 0.648 0.848 0.84 0.84 ...
##  $ Pressure.Vacuum  : num  -44.9 -44.9 -39.6 -56.9 -56.9 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Alch.Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling.Lvl      : num  0.678 0.695 0.925 0.903 0.903 ...

#cat("\n\nStructure of Transformed Evaluation Set (Yeo-Johnson Applied):\n\n")
#str(evaluation_set_transformed)

Skewness Comparison: Impact of Yeo-Johnson Transformation

Overview:
- This visualization illustrates the skewness of numeric variables before and after the Yeo-Johnson transformation, highlighting its ability to handle both positive and negative values effectively while reducing skewness.
Key Observations:
- Significant Skewness Reduction: Variables such as Hyd.Pressure2, MFR, and Carb.Volume show substantial decreases in skewness, transitioning towards more symmetric distributions favorable for modeling.
- Minimal Changes: Variables like Temperature and PH, with low initial skewness, show little to no effect, indicating that the transformation preserves already symmetric distributions.
Conclusion:
- The Yeo-Johnson transformation effectively reduces skewness, particularly in highly skewed variables, improving data symmetry. This makes the dataset more suitable for statistical and machine learning models that benefit from normally distributed inputs.

# Load necessary library
if (!requireNamespace("e1071", quietly = TRUE)) {
  install.packages("e1071")
}
library(e1071)  # For calculating skewness

# Select numeric columns in the original and transformed datasets
numeric_cols <- names(train_set_imputed)[sapply(train_set_imputed, is.numeric)]

# Calculate skewness for train_set_imputed (Before transformation)
skewness_imputed <- sapply(train_set_imputed[numeric_cols], skewness, na.rm = TRUE)

# Calculate skewness for train_set_transformed (After Yeo-Johnson transformation)
skewness_transformed <- sapply(train_set_transformed[numeric_cols], skewness, na.rm = TRUE)

# Combine the skewness results into a data frame for comparison
skewness_comparison <- data.frame(
  Variable = numeric_cols,
  Skewness_Before = skewness_imputed,
  Skewness_After = skewness_transformed
)

# Reshape the data for plotting
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

skewness_long <- melt(skewness_comparison, id.vars = "Variable", 
                      variable.name = "Skewness_Type", value.name = "Skewness")

# Plot the skewness comparison
library(ggplot2)
ggplot(skewness_long, aes(x = Variable, y = Skewness, fill = Skewness_Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(
    title = "Skewness Comparison: Before and After Yeo-Johnson Transformation",
    x = "Variable",
    y = "Skewness"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 8), legend.position = "top")

Remove Near-zero Variance Highly Correlated Variables for Model Suitability

In the dataset, several near-zero variance and highly correlated variables were identified and removed, including Balling, Alch.Rel, Balling.Lvl, Density, and others. Removing these variables is crucial for models that are sensitive to multicollinearity or feature redundancy, such as:

Linear Regression: To prevent inflated variance estimates and ensure stable coefficients.
Logistic Regression: To enhance interpretability and maintain prediction accuracy.
PCA (Principal Component Analysis): To avoid redundant variables dominating principal components.
Regularized Models (e.g., Lasso, Ridge): While these models address multicollinearity, removing redundant variables improves computational efficiency.

For tree-based models or non-linear algorithms (e.g., Random Forest, XGBoost), this step is generally unnecessary, as these models are robust to multicollinearity and handle feature interactions inherently.

# Load necessary libraries
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("corrplot", quietly = TRUE)) {
  install.packages("corrplot")
}

library(caret)
library(corrplot)

## corrplot 0.95 loaded

library(dplyr)

# Save a copy of the full datasets
full_train_data <- train_set_transformed
full_evaluation_data <- evaluation_set_transformed

# Separate Brand.Code and PH for exclusion during analysis
brand_and_ph <- train_set_transformed %>% select(Brand.Code, PH)
train_set_no_brand_ph <- train_set_transformed %>% select(-Brand.Code, -PH)

# Step 1: Identify numeric variables in training data (excluding Brand.Code and PH)
numeric_train_data <- train_set_no_brand_ph[, sapply(train_set_no_brand_ph, is.numeric)]

# Step 2: Identify and remove near-zero variance variables
nzv <- nearZeroVar(numeric_train_data, saveMetrics = TRUE)

# Names of variables with near-zero variance
nzv_vars <- rownames(nzv[nzv$nzv, ])
cat("Near-zero variance variables removed:\n")

## Near-zero variance variables removed:

print(nzv_vars)

## [1] "Hyd.Pressure1"

# Remove near-zero variance variables
filtered_train_nzv <- numeric_train_data[, !(colnames(numeric_train_data) %in% nzv_vars)]

# Step 3: Compute the correlation matrix (before removing highly correlated variables)
correlations_before <- cor(filtered_train_nzv, use = "complete.obs")

# Identify highly correlated variables (absolute correlation > 0.75)
highCorr <- findCorrelation(correlations_before, cutoff = 0.75)

# Get the names of highly correlated variables
high_corr_vars <- colnames(filtered_train_nzv)[highCorr]

# Print the number and names of highly correlated variables
cat("\nNumber of highly correlated variables:", length(high_corr_vars), "\n")

## 
## Number of highly correlated variables: 9

cat("Highly correlated variables removed:\n")

## Highly correlated variables removed:

print(high_corr_vars)

## [1] "Balling"       "Density"       "Alch.Rel"      "Hyd.Pressure3"
## [5] "Balling.Lvl"   "Carb.Volume"   "Filler.Level"  "Filler.Speed" 
## [9] "Carb.Temp"

# Remove highly correlated variables from the training data
filtered_data_final <- filtered_train_nzv[, -highCorr]

# Step 4: Add back Brand.Code and PH
filtered_data_final <- cbind(brand_and_ph, filtered_data_final)

# Compute the correlation matrix after removing highly correlated variables
correlations_after <- cor(filtered_data_final %>% select(-Brand.Code, -PH), use = "complete.obs")

# Step 5: Plot the correlation matrices (before and after)
par(mfrow = c(1, 2))  # Set up side-by-side plots

# Plot before removing highly correlated variables
corrplot(correlations_before, order = "hclust", tl.cex = 0.8, addrect = 2)
mtext("Before Removing\nHighly Correlated Variables", side = 3, line = 1, adj = 0.5, cex = 1.2)

# Plot after removing highly correlated variables
corrplot(correlations_after, order = "hclust", tl.cex = 0.8, addrect = 2)
mtext("After Removing\nHighly Correlated Variables", side = 3, line = 1, adj = 0.5, cex = 1.2)

# Reset plotting area
par(mfrow = c(1, 1))

# Step 6: Process evaluation set similarly
# Separate Brand.Code and PH
evaluation_brand_and_ph <- evaluation_set_transformed %>% select(Brand.Code, PH)
evaluation_set_no_brand_ph <- evaluation_set_transformed %>% select(-Brand.Code, -PH)

# Remove columns not in filtered_data_final
filtered_evaluation_final <- evaluation_set_no_brand_ph %>%
  select(all_of(colnames(filtered_data_final %>% select(-Brand.Code, -PH))))

# Add back Brand.Code and PH
filtered_evaluation_final <- cbind(evaluation_brand_and_ph, filtered_evaluation_final)

# Step 7: Save and inspect final datasets
# cat("\n\nStructure of Final Training Dataset:\n")
# str(filtered_data_final)

# cat("\n\nStructure of Final Evaluation Dataset:\n")
# str(filtered_evaluation_final)

Prepare Datasets for Models

Below is the datasets for gradient-based models, tree-based models, and statistical models, reflecting the processing steps and appropriate handling of categorical variables like Brand_Code:

1. Gradient-Based Models (e.g., Neural Networks, SVM, KNN)

For gradient-based models: - Min-max scaling is applied to all numeric variables, including the target variable PH. - One-hot encoding is applied to Brand_Code without dropping any dummy variables.

# Install required packages if not already installed
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("fastDummies", quietly = TRUE)) {
  install.packages("fastDummies")
}

# Load libraries
library(caret)
library(fastDummies)

# Set seed for reproducibility
set.seed(123)

# Step 1: Identify Numeric Columns (Excluding "PH")
numeric_cols <- setdiff(names(filtered_data_final)[sapply(filtered_data_final, is.numeric)], "PH")

# Step 2: Scaling the Training Dataset
gradient_based_data <- filtered_data_final  

# Save scaling parameters during training
min_max_scaler <- preProcess(gradient_based_data[, numeric_cols], method = "range")
gradient_based_data[, numeric_cols] <- predict(min_max_scaler, gradient_based_data[, numeric_cols])

# Save min-max scaling parameters for reverse scaling later
gbm_scaling_params <- data.frame(
  feature = numeric_cols,
  min = apply(filtered_data_final[, numeric_cols], 2, min),
  max = apply(filtered_data_final[, numeric_cols], 2, max)
)

# Step 3: Scaling the Evaluation Dataset (Excluding "PH")
gradient_models_evaluation_data <- filtered_evaluation_final  # Replace with actual evaluation dataset

# Convert PH to numeric
gradient_models_evaluation_data$PH <- as.numeric(as.character(gradient_models_evaluation_data$PH))

# Apply the same scaling to the numeric columns (excluding "PH")
numeric_evaluation_cols <- setdiff(names(gradient_models_evaluation_data)[sapply(gradient_models_evaluation_data, is.numeric)], "PH")
gradient_models_evaluation_data[, numeric_evaluation_cols] <- predict(min_max_scaler, gradient_models_evaluation_data[, numeric_evaluation_cols])

# Step 4: Reordering Columns to Match Training Data
gradient_models_evaluation_data <- gradient_models_evaluation_data[, colnames(gradient_based_data), drop = FALSE]

# Step 5: One-Hot Encoding for Categorical Variables
gradient_based_data <- dummy_cols(
  gradient_based_data,
  select_columns = "Brand.Code",
  remove_first_dummy = FALSE,  # Keep all dummy variables
  remove_selected_columns = TRUE
)
gradient_models_evaluation_data <- dummy_cols(
  gradient_models_evaluation_data,
  select_columns = "Brand.Code",
  remove_first_dummy = FALSE,
  remove_selected_columns = TRUE
)

# Ensure PH remains numeric in evaluation dataset
gradient_models_evaluation_data$PH <- as.numeric(gradient_models_evaluation_data$PH)

# Step 6: Preparing Evaluation Data for Prediction 
gradient_models_evaluation_data_final <- gradient_models_evaluation_data[, !colnames(gradient_models_evaluation_data) %in% "PH"]

# Step 9: Display gradient_based_data of Final Datasets
cat("\n\nStructure of statistical_models_data:\n")

## 
## 
## Structure of statistical_models_data:

str(gradient_based_data)

## 'data.frame':    2571 obs. of  26 variables:
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Fill.Ounces      : num  0.481 0.539 0.617 0.539 0.99 ...
##  $ PC.Volume        : num  0.54 0.477 0.54 0.613 0.107 ...
##  $ Carb.Pressure    : num  0.556 0.564 0.667 0.314 0.511 ...
##  $ PSC              : num  0.3806 0.4552 0.3284 0.2612 0.0896 ...
##  $ PSC.Fill         : num  0.646 0.579 0.757 0.846 0.461 ...
##  $ PSC.CO2          : num  0.167 0.167 0.667 0.167 0.5 ...
##  $ Mnf.Flow         : num  0.000757 0.000757 0.000757 0.000757 0.000757 ...
##  $ Carb.Pressure1   : num  0.398 0.479 0.439 0.291 0.386 ...
##  $ Fill.Pressure    : num  0.554 0.554 0.554 0.569 0.546 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  0.835 0.736 0.486 0.6 0.6 ...
##  $ Temperature      : num  0.19 0.317 0.27 0.159 0.159 ...
##  $ Usage.cont       : num  0.297 0.566 0.411 0.386 0.405 ...
##  $ Carb.Flow        : num  0.486 0.532 0.482 0.514 0.513 ...
##  $ MFR              : num  0.828 0.831 0.84 0.835 0.826 ...
##  $ Pressure.Vacuum  : num  0.92 0.92 0.962 0.827 0.827 ...
##  $ Oxygen.Filler    : num  0.0493 0.0594 0.0543 0.0694 0.0694 ...
##  $ Bowl.Setpoint    : num  0.714 0.714 0.714 0.714 0.714 ...
##  $ Pressure.Setpoint: num  0.3 0.35 0.325 0.25 0.25 ...
##  $ Air.Pressurer    : num  0.243 0.297 0.162 0.73 0.73 ...
##  $ Carb.Rel         : num  0.327 0.309 0.8 0.418 0.436 ...
##  $ Brand.Code_A     : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Brand.Code_B     : int  1 0 1 0 0 0 0 1 1 1 ...
##  $ Brand.Code_C     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Brand.Code_D     : int  0 0 0 0 0 0 0 0 0 0 ...

cat("\n\nStructure of gradient_models_evaluation_data:\n")

## 
## 
## Structure of gradient_models_evaluation_data:

str(gradient_models_evaluation_data_final)

## 'data.frame':    267 obs. of  25 variables:
##  $ Fill.Ounces      : num  -14.7 -14.7 -14.7 -14.7 -14.7 ...
##  $ PC.Volume        : num  0.819 0.636 0.96 0.466 0.358 ...
##  $ Carb.Pressure    : num  -23.8 -23.8 -23.8 -23.8 -23.8 ...
##  $ PSC              : num  0.87313 0.14925 0.24627 0.00746 0.14179 ...
##  $ PSC.Fill         : num  0.817 0.575 0.317 0.539 0.699 ...
##  $ PSC.CO2          : num  0.1667 0.3333 0.0833 0.0833 0.25 ...
##  $ Mnf.Flow         : num  -0.00467 -0.00467 -0.00467 -0.00467 -0.00467 ...
##  $ Carb.Pressure1   : num  -2.95 -2.93 -2.92 -2.88 -2.97 ...
##  $ Fill.Pressure    : num  185 185 184 170 198 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  4.48 4.82 4.53 5.17 4.43 ...
##  $ Temperature      : num  0.19 0.159 0.159 0.857 0.222 ...
##  $ Usage.cont       : num  0.693 0.399 0.876 0.437 0.669 ...
##  $ Carb.Flow        : num  0.090152 0.089034 0.093644 -0.000437 0.098863 ...
##  $ MFR              : num  0.832 0.841 0.84 0.336 0.861 ...
##  $ Pressure.Vacuum  : num  0.895 0.72 0.783 0.841 0.841 ...
##  $ Oxygen.Filler    : num  0.0493 0.0694 0.1097 0.2103 0.2002 ...
##  $ Bowl.Setpoint    : num  0.857 0.714 0.714 0.714 0.714 ...
##  $ Pressure.Setpoint: num  -5.44 -5.44 -5.44 -5.44 -5.44 ...
##  $ Air.Pressurer    : num  0.243 0.865 0.784 0.757 0.676 ...
##  $ Carb.Rel         : num  0.345 0.564 0.345 0.491 0.382 ...
##  $ Brand.Code_A     : int  0 1 0 0 0 0 1 0 1 0 ...
##  $ Brand.Code_B     : int  0 0 1 1 1 1 0 1 0 0 ...
##  $ Brand.Code_C     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Brand.Code_D     : int  1 0 0 0 0 0 0 0 0 1 ...

2. Tree-Based Models (e.g., Random Forest, XGBoost, MARS)

For tree-based models: - The dataset is used as is, without scaling numeric variables or transforming the target variable PH. - Label encoding is applied to the Brand_Code variable instead of one-hot encoding.

# Label Encoding for Tree-Based Models
tree_based_data <- train_set_transformed
tree_based_evaluation_data <- evaluation_set_transformed

# Convert Brand_Code to numeric labels
tree_based_data$Brand_Code <- as.numeric(factor(tree_based_data$Brand.Code))
tree_based_evaluation_data$Brand_Code <- as.numeric(factor(tree_based_evaluation_data$Brand.Code))


# Check structure of the final dataset
str(tree_based_data)

## 'data.frame':    2571 obs. of  34 variables:
##  $ Brand.Code       : Factor w/ 4 levels "A","B","C","D": 2 1 2 1 1 1 1 2 2 2 ...
##  $ Carb.Volume      : num  5.34 5.43 5.29 5.44 5.49 ...
##  $ Fill.Ounces      : num  729 732 735 732 752 ...
##  $ PC.Volume        : num  0.208 0.193 0.208 0.227 0.1 ...
##  $ Carb.Pressure    : num  2.14 2.14 2.15 2.12 2.14 ...
##  $ Carb.Temp        : num  0.5 0.5 0.5 0.5 0.5 ...
##  $ PSC              : num  0.104 0.124 0.09 0.072 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.169 0.152 0.199 0.222 0.121 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -114 -114 -114 -114 -114 ...
##  $ Carb.Pressure1   : num  21.2 21.5 21.4 20.9 21.2 ...
##  $ Fill.Pressure    : num  1.41 1.41 1.41 1.41 1.41 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  2.64 2.61 2.54 2.57 2.57 ...
##  $ Filler.Level     : num  121 119 120 118 119 ...
##  $ Filler.Speed     : int  4002 3986 4020 4012 4010 4014 1010 1004 4014 4028 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : num  24689 27033 24492 26121 26032 ...
##  $ Density          : num  0.453 0.463 0.584 0.579 0.579 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Balling          : num  0.628 0.648 0.848 0.84 0.84 ...
##  $ Pressure.Vacuum  : num  -44.9 -44.9 -39.6 -56.9 -56.9 ...
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Alch.Rel         : num  6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Balling.Lvl      : num  0.678 0.695 0.925 0.903 0.903 ...
##  $ Brand_Code       : num  2 1 2 1 1 1 1 2 2 2 ...

3. Statistical Models (e.g., Linear Regression, Logistic Regression, PCA)

For statistical models: - Standardization (mean centering and scaling to unit variance) is applied to all numeric variables, including the target variable PH. - One-hot encoding is applied to Brand_Code, with one dummy variable dropped to avoid multicollinearity.

# Install required packages if not already installed
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("fastDummies", quietly = TRUE)) {
  install.packages("fastDummies")
}

# Load libraries
library(caret)
library(fastDummies)

# Set seed for reproducibility
set.seed(123)

# Step 1: Identify Numeric Columns (Excluding "PH")
numeric_cols <- setdiff(names(filtered_data_final)[sapply(filtered_data_final, is.numeric)], "PH")

# Step 2: Standardizing the Training Dataset
statistical_models_data <- filtered_data_final  # Replace with actual training dataset

# Save standardization parameters during training
standard_scaler <- preProcess(statistical_models_data[, numeric_cols], method = c("center", "scale"))
statistical_models_data[, numeric_cols] <- predict(standard_scaler, statistical_models_data[, numeric_cols])

# Save standardization parameters for potential reverse standardization later
statistical_scaling_params <- data.frame(
  feature = numeric_cols,
  mean = apply(filtered_data_final[, numeric_cols], 2, mean),
  sd = apply(filtered_data_final[, numeric_cols], 2, sd)
)

# Step 3: Standardizing the Evaluation Dataset
statistical_models_evaluation_data <- filtered_evaluation_final  # Replace with actual evaluation dataset

# Convert PH to numeric
statistical_models_evaluation_data$PH <- as.numeric(as.character(statistical_models_evaluation_data$PH))

# Apply standardization to numeric columns in the evaluation dataset (excluding "PH")
numeric_evaluation_cols <- setdiff(names(statistical_models_evaluation_data)[sapply(statistical_models_evaluation_data, is.numeric)], "PH")
statistical_models_evaluation_data[, numeric_evaluation_cols] <- predict(standard_scaler, statistical_models_evaluation_data[, numeric_evaluation_cols])

# Step 4: One-Hot Encoding for Categorical Variables (Drop one dummy)
statistical_models_data <- dummy_cols(
  statistical_models_data,
  select_columns = "Brand.Code",
  remove_first_dummy = TRUE,  # Drop one dummy variable
  remove_selected_columns = TRUE
)
statistical_models_evaluation_data <- dummy_cols(
  statistical_models_evaluation_data,
  select_columns = "Brand.Code",
  remove_first_dummy = TRUE,  # Drop one dummy variable
  remove_selected_columns = TRUE
)

# Ensure PH remains numeric in the evaluation dataset
statistical_models_evaluation_data$PH <- as.numeric(statistical_models_evaluation_data$PH)

# Step 5: Reordering Columns
# Ensure the evaluation data columns match the training data
statistical_models_evaluation_data <- statistical_models_evaluation_data[, colnames(statistical_models_data), drop = FALSE]
if (identical(colnames(statistical_models_data), colnames(statistical_models_evaluation_data))) {
  cat("The column ordering is identical between the training and evaluation datasets.\n")
} else {
  cat("The column ordering is NOT identical. Differences:\n")
  setdiff(colnames(statistical_models_data), colnames(statistical_models_evaluation_data))
}

## The column ordering is identical between the training and evaluation datasets.

# Step 6: Checking the Structure of Final Datasets
cat("\n\nStructure of statistical_models_data:\n")

## 
## 
## Structure of statistical_models_data:

str(statistical_models_data)

## 'data.frame':    2571 obs. of  25 variables:
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Fill.Ounces      : num  -0.0937 0.3642 0.9761 0.3642 3.9061 ...
##  $ PC.Volume        : num  -0.192 -0.615 -0.192 0.299 -3.092 ...
##  $ Carb.Pressure    : num  0.0315 0.0874 0.7418 -1.5053 -0.2515 ...
##  $ PSC              : num  0.395 0.801 0.111 -0.254 -1.186 ...
##  $ PSC.Fill         : num  0.711 0.397 1.237 1.654 -0.157 ...
##  $ PSC.CO2          : num  -0.382 -0.382 2.389 -0.382 1.466 ...
##  $ Mnf.Flow         : num  -1.05 -1.05 -1.05 -1.05 -1.05 ...
##  $ Carb.Pressure1   : num  -0.793 -0.197 -0.494 -1.569 -0.879 ...
##  $ Fill.Pressure    : num  -0.573 -0.573 -0.573 -0.438 -0.64 ...
##  $ Hyd.Pressure2    : num  -1.33 -1.33 -1.33 -1.33 -1.33 ...
##  $ Hyd.Pressure4    : num  1.531 0.772 -1.135 -0.264 -0.264 ...
##  $ Temperature      : num  0.0192 1.1666 0.7363 -0.2677 -0.2677 ...
##  $ Usage.cont       : num  -1.617 -0.367 -1.086 -1.2 -1.113 ...
##  $ Carb.Flow        : num  0.401 0.633 0.382 0.543 0.534 ...
##  $ MFR              : num  0.395 0.408 0.468 0.436 0.38 ...
##  $ Pressure.Vacuum  : num  1.89 1.89 2.11 1.39 1.39 ...
##  $ Oxygen.Filler    : num  -0.534 -0.449 -0.491 -0.364 -0.364 ...
##  $ Bowl.Setpoint    : num  0.698 0.698 0.698 0.698 0.698 ...
##  $ Pressure.Setpoint: num  -0.594 -0.398 -0.496 -0.791 -0.791 ...
##  $ Air.Pressurer    : num  -0.193 0.137 -0.688 2.777 2.777 ...
##  $ Carb.Rel         : num  -0.9058 -1.061 3.1308 -0.1295 0.0257 ...
##  $ Brand.Code_B     : int  1 0 1 0 0 0 0 1 1 1 ...
##  $ Brand.Code_C     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Brand.Code_D     : int  0 0 0 0 0 0 0 0 0 0 ...

cat("\n\nStructure of statistical_models_evaluation_data:\n")

## 
## 
## Structure of statistical_models_evaluation_data:

str(statistical_models_evaluation_data)

## 'data.frame':    267 obs. of  25 variables:
##  $ PH               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Fill.Ounces      : num  -120 -120 -120 -120 -120 ...
##  $ PC.Volume        : num  1.672 0.45 2.62 -0.689 -1.412 ...
##  $ Carb.Pressure    : num  -155 -155 -155 -155 -155 ...
##  $ PSC              : num  3.071 -0.862 -0.335 -1.632 -0.902 ...
##  $ PSC.Fill         : num  1.52 0.379 -0.837 0.209 0.963 ...
##  $ PSC.CO2          : num  -0.3817 0.542 -0.8436 -0.8436 0.0801 ...
##  $ Mnf.Flow         : num  -1.06 -1.06 -1.06 -1.06 -1.06 ...
##  $ Carb.Pressure1   : num  -25.3 -25.1 -25 -24.7 -25.4 ...
##  $ Fill.Pressure    : num  1610 1614 1606 1478 1721 ...
##  $ Hyd.Pressure2    : num  -1.33 -1.33 -1.33 -1.33 -1.33 ...
##  $ Hyd.Pressure4    : num  29.4 32 29.7 34.7 29 ...
##  $ Temperature      : num  0.0192 -0.2677 -0.2677 6.0431 0.306 ...
##  $ Usage.cont       : num  0.224 -1.14 1.07 -0.965 0.11 ...
##  $ Carb.Flow        : num  -1.58 -1.58 -1.56 -2.03 -1.53 ...
##  $ MFR              : num  0.414 0.474 0.466 -2.585 0.591 ...
##  $ Pressure.Vacuum  : num  1.75 0.816 1.155 1.466 1.466 ...
##  $ Oxygen.Filler    : num  -0.5339 -0.3636 -0.0229 0.8289 0.7437 ...
##  $ Bowl.Setpoint    : num  1.35 0.698 0.698 0.698 0.698 ...
##  $ Pressure.Setpoint: num  -23.1 -23.1 -23.1 -23.1 -23.1 ...
##  $ Air.Pressurer    : num  -0.193 3.603 3.107 2.942 2.447 ...
##  $ Carb.Rel         : num  -0.751 1.112 -0.751 0.491 -0.44 ...
##  $ Brand.Code_B     : int  0 0 1 1 1 1 0 1 0 0 ...
##  $ Brand.Code_C     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Brand.Code_D     : int  1 0 0 0 0 0 0 0 0 1 ...

4. Statistical Models Data without Stardardization

For statistical models: - One-hot encoding is applied to Brand_Code, with one dummy variable dropped to avoid multicollinearity.

# Install required packages if not already installed
if (!requireNamespace("fastDummies", quietly = TRUE)) {
  install.packages("fastDummies")
}

# Load libraries
library(fastDummies)

# Set seed for reproducibility
set.seed(123)

# Step 1: Identify Numeric Columns (Excluding "PH")
numeric_cols <- setdiff(names(filtered_data_final)[sapply(filtered_data_final, is.numeric)], "PH")

# Step 2: Assign the Training Dataset
statistical_models_data_raw <- filtered_data_final  # Replace with actual training dataset

# Step 3: Assign the Evaluation Dataset
statistical_models_evaluation_data_raw <- filtered_evaluation_final  # Replace with actual evaluation dataset

# Convert PH to numeric in the evaluation dataset
statistical_models_data_raw$PH <- as.numeric(as.character(statistical_models_data_raw$PH))

# Step 4: One-Hot Encoding for Categorical Variables (Drop one dummy)
statistical_models_data_raw <- dummy_cols(
  statistical_models_data_raw,
  select_columns = "Brand.Code",
  remove_first_dummy = TRUE,  # Drop one dummy variable
  remove_selected_columns = TRUE
)
statistical_models_evaluation_data_raw <- dummy_cols(
  statistical_models_evaluation_data_raw,
  select_columns = "Brand.Code",
  remove_first_dummy = TRUE,  # Drop one dummy variable
  remove_selected_columns = TRUE
)

# Step 5: Reordering Columns
# Ensure the evaluation data columns match the training data
statistical_models_evaluation_data_raw <- statistical_models_evaluation_data_raw[, colnames(statistical_models_data_raw), drop = FALSE]
if (identical(colnames(statistical_models_data_raw), colnames(statistical_models_evaluation_data_raw))) {
  cat("The column ordering is identical between the training and evaluation datasets.\n")
} else {
  cat("The column ordering is NOT identical. Differences:\n")
  setdiff(colnames(statistical_models_data_raw), colnames(statistical_models_evaluation_data_raw))
}

## The column ordering is identical between the training and evaluation datasets.

# Step 6: Checking the Structure of Final Datasets
cat("\n\nStructure of statistical_models_data:\n")

## 
## 
## Structure of statistical_models_data:

str(statistical_models_data_raw)

## 'data.frame':    2571 obs. of  25 variables:
##  $ PH               : num  8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
##  $ Fill.Ounces      : num  729 732 735 732 752 ...
##  $ PC.Volume        : num  0.208 0.193 0.208 0.227 0.1 ...
##  $ Carb.Pressure    : num  2.14 2.14 2.15 2.12 2.14 ...
##  $ PSC              : num  0.104 0.124 0.09 0.072 0.026 0.09 0.128 0.154 0.132 0.014 ...
##  $ PSC.Fill         : num  0.169 0.152 0.199 0.222 0.121 ...
##  $ PSC.CO2          : num  0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
##  $ Mnf.Flow         : num  -114 -114 -114 -114 -114 ...
##  $ Carb.Pressure1   : num  21.2 21.5 21.4 20.9 21.2 ...
##  $ Fill.Pressure    : num  1.41 1.41 1.41 1.41 1.41 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  2.64 2.61 2.54 2.57 2.57 ...
##  $ Temperature      : num  66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
##  $ Usage.cont       : num  16.2 19.9 17.8 17.4 17.7 ...
##  $ Carb.Flow        : num  24689 27033 24492 26121 26032 ...
##  $ MFR              : num  725 727 735 731 723 ...
##  $ Pressure.Vacuum  : num  -44.9 -44.9 -39.6 -56.9 -56.9 ...
##  $ Oxygen.Filler    : num  0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
##  $ Bowl.Setpoint    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  46.4 46.8 46.6 46 46 46 46 46 46 46 ...
##  $ Air.Pressurer    : num  143 143 142 146 146 ...
##  $ Carb.Rel         : num  5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
##  $ Brand.Code_B     : int  1 0 1 0 0 0 0 1 1 1 ...
##  $ Brand.Code_C     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Brand.Code_D     : int  0 0 0 0 0 0 0 0 0 0 ...

cat("\n\nStructure of statistical_models_evaluation_data:\n")

## 
## 
## Structure of statistical_models_evaluation_data:

str(statistical_models_evaluation_data_raw)

## 'data.frame':    267 obs. of  25 variables:
##  $ PH               : logi  NA NA NA NA NA NA ...
##  $ Fill.Ounces      : num  24 24 23.9 23.9 24.2 ...
##  $ PC.Volume        : num  0.278 0.232 0.313 0.19 0.163 ...
##  $ Carb.Pressure    : num  0.471 0.471 0.471 0.471 0.471 ...
##  $ PSC              : num  0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
##  $ PSC.Fill         : num  0.2144 0.1509 0.0832 0.1414 0.1834 ...
##  $ PSC.CO2          : num  0.04 0.08 0.02 0.02 0.06 0.02 0 0.04 0.04 0.02 ...
##  $ Mnf.Flow         : num  -115 -115 -115 -115 -115 ...
##  $ Carb.Pressure1   : num  9.74 9.81 9.85 9.99 9.69 ...
##  $ Fill.Pressure    : num  9.73 9.76 9.71 9.05 10.31 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  3.68 3.77 3.69 3.88 3.66 ...
##  $ Temperature      : num  66 65.6 65.6 74.4 66.4 66.6 66.8 70.6 65.8 66 ...
##  $ Usage.cont       : num  21.7 17.6 24.2 18.1 21.3 ...
##  $ Carb.Flow        : num  4624 4567.4 4801 32.9 5065.5 ...
##  $ MFR              : num  728 736 735 313 752 ...
##  $ Pressure.Vacuum  : num  -48.2 -70.6 -62.5 -55 -55 ...
##  $ Oxygen.Filler    : num  0.022 0.03 0.046 0.086 0.082 0.064 0.042 0.096 0.046 0.096 ...
##  $ Bowl.Setpoint    : int  130 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  0.452 0.452 0.452 0.452 0.452 ...
##  $ Air.Pressurer    : num  143 147 147 146 146 ...
##  $ Carb.Rel         : num  5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
##  $ Brand.Code_B     : int  0 0 1 1 1 1 0 1 0 0 ...
##  $ Brand.Code_C     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Brand.Code_D     : int  1 0 0 0 0 0 0 0 0 1 ...

Model Building

Ordinary Least Squares (OLS) Models

Multiple Linear Regression (MLR) models will be built.

Process of OLS Modeling

Data Preparation: Split the dataset into training (80%) and testing (20%) sets to evaluate model performance.
Baseline MLR Model: Fit a multiple linear regression (MLR) model using all available features.
Feature Selection: Use forward selection, backward elimination, stepwise selection, and recursive feature elimination (RFE) to identify subsets of relevant predictors.
Feature Engineering: Create new features based on domain knowledge or interactions to improve model performance.
Advanced Models: Incorporate interaction terms or polynomial features to capture complex relationships between variables.
Evaluation: Evaluate models using metrics like \(R^2\), RMSE, and Test RMSE to identify the best-performing model.

OLS Model Performance

The Interaction Terms Model emerged as the best model with the lowest Test_RMSE (0.7320), a relatively high Train_R2 (0.6343), and a balanced generalization ability. This indicates that incorporating interactions between variables significantly enhances predictive performance.

# Load required libraries
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("Metrics", quietly = TRUE)) {
  install.packages("Metrics")
}
library(caret)
library(Metrics)

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

library(dplyr)

# Set seed for reproducibility
set.seed(123)

# Split data into training (80%) and testing (20%) sets
ols_train_indices <- createDataPartition(statistical_models_data_raw$PH, p = 0.8, list = FALSE)
ols_train_data <- statistical_models_data_raw[ols_train_indices, ]
ols_test_data <- statistical_models_data_raw[-ols_train_indices, ]

### Full Model ###
full_model <- lm(PH ~ ., data = ols_train_data)

# Null model (intercept only)
null_model <- lm(PH ~ 1, data = ols_train_data)

# Function to evaluate models
evaluate_model <- function(model, ols_train_data, ols_test_data) {
  train_predictions <- predict(model, newdata = ols_train_data)
  test_predictions <- predict(model, newdata = ols_test_data)
  
  # Training Metrics
  train_r2 <- summary(model)$r.squared
  train_mse <- mean((ols_train_data$PH - train_predictions)^2)
  train_rmse <- sqrt(train_mse)
  
  # Testing Metrics
  test_r2 <- 1 - sum((ols_test_data$PH - test_predictions)^2) / sum((ols_test_data$PH - mean(ols_test_data$PH))^2)
  test_mse <- mean((ols_test_data$PH - test_predictions)^2)
  test_rmse <- sqrt(test_mse)
  
  return(list(
    Train_R2 = train_r2,
    Train_MSE = train_mse,
    Train_RMSE = train_rmse,
    Test_R2 = test_r2,
    Test_MSE = test_mse,
    Test_RMSE = test_rmse
  ))
}

### 1. Forward Selection ###
forward_model <- step(null_model, 
                      scope = list(lower = null_model, upper = full_model),
                      direction = "forward", 
                      trace = 0)
forward_metrics <- evaluate_model(forward_model, ols_train_data, ols_test_data)

### 2. Backward Elimination ###
backward_model <- step(full_model, 
                       direction = "backward", 
                       trace = 0)
backward_metrics <- evaluate_model(backward_model, ols_train_data, ols_test_data)

### 3. Stepwise Selection ###
stepwise_model <- step(null_model, 
                       scope = list(lower = null_model, upper = full_model),
                       direction = "both", 
                       trace = 0)
stepwise_metrics <- evaluate_model(stepwise_model, ols_train_data, ols_test_data)

### 4. Recursive Feature Elimination (RFE) ###
control <- rfeControl(functions = lmFuncs, method = "cv", number = 10)
rfe_model <- rfe(ols_train_data[, -which(names(ols_train_data) == "PH")], 
                 ols_train_data$PH, 
                 sizes = c(1:10), 
                 rfeControl = control)
# Fit a model using RFE-selected predictors
selected_vars <- predictors(rfe_model)
rfe_final_model <- lm(as.formula(paste("PH ~", paste(selected_vars, collapse = " + "))), data = ols_train_data)
rfe_metrics <- evaluate_model(rfe_final_model, ols_train_data, ols_test_data)

### 5. Interaction Terms ###
interaction_model <- lm(PH ~ .^2, data = ols_train_data) # Includes all pairwise interactions
interaction_metrics <- evaluate_model(interaction_model, ols_train_data, ols_test_data)

### 6. Polynomial Features ###
polynomial_formula <- as.formula(
  paste("PH ~ poly(Fill.Ounces, 2) + poly(PC.Volume, 2) + poly(Carb.Pressure, 2) + .", collapse = "+")
)
polynomial_model <- lm(polynomial_formula, data = ols_train_data)
polynomial_metrics <- evaluate_model(polynomial_model, ols_train_data, ols_test_data)

### 7. Feature Engineering ###
ols_train_data_fe <- ols_train_data %>%
  mutate(
    Pressure_Ratio = Fill.Pressure / Carb.Pressure,
    Temperature_Deviation = abs(Temperature - mean(Temperature, na.rm = TRUE)),
    PSC_Carb_Interaction = PSC * Carb.Pressure
  )
ols_test_data_fe <- ols_test_data %>%
  mutate(
    Pressure_Ratio = Fill.Pressure / Carb.Pressure,
    Temperature_Deviation = abs(Temperature - mean(ols_train_data$Temperature, na.rm = TRUE)),
    PSC_Carb_Interaction = PSC * Carb.Pressure
  )
fe_model <- lm(PH ~ ., data = ols_train_data_fe)
fe_metrics <- evaluate_model(fe_model, ols_train_data_fe, ols_test_data_fe)

### Combine All Metrics into a Single DataFrame ###
metrics_df <- data.frame(
  Model = c(
    "MLR (All Features Included)", 
    "Forward Selection", 
    "Backward Elimination", 
    "Stepwise Selection", 
    "Recursive Feature Elimination (RFE)", 
    "Interaction Terms", 
    "Polynomial Features", 
    "Feature Engineering"
  ),
  Train_R2 = c(
    summary(full_model)$r.squared, 
    forward_metrics$Train_R2, 
    backward_metrics$Train_R2, 
    stepwise_metrics$Train_R2, 
    rfe_metrics$Train_R2, 
    interaction_metrics$Train_R2, 
    polynomial_metrics$Train_R2, 
    fe_metrics$Train_R2
  ),
  Test_R2 = c(
    evaluate_model(full_model, ols_train_data, ols_test_data)$Test_R2, 
    forward_metrics$Test_R2, 
    backward_metrics$Test_R2, 
    stepwise_metrics$Test_R2, 
    rfe_metrics$Test_R2, 
    interaction_metrics$Test_R2, 
    polynomial_metrics$Test_R2, 
    fe_metrics$Test_R2
  ),
  Train_RMSE = c(
    evaluate_model(full_model, ols_train_data, ols_test_data)$Train_RMSE, 
    forward_metrics$Train_RMSE, 
    backward_metrics$Train_RMSE, 
    stepwise_metrics$Train_RMSE, 
    rfe_metrics$Train_RMSE, 
    interaction_metrics$Train_RMSE, 
    polynomial_metrics$Train_RMSE, 
    fe_metrics$Train_RMSE
  ),
  Test_RMSE = c(
    evaluate_model(full_model, ols_train_data, ols_test_data)$Test_RMSE, 
    forward_metrics$Test_RMSE, 
    backward_metrics$Test_RMSE, 
    stepwise_metrics$Test_RMSE, 
    rfe_metrics$Test_RMSE, 
    interaction_metrics$Test_RMSE, 
    polynomial_metrics$Test_RMSE, 
    fe_metrics$Test_RMSE
  )
)

# Sort metrics_df by Test_RMSE in ascending order
metrics_df <- metrics_df %>%
  arrange(Test_RMSE)

# Display the sorted DataFrame
metrics_df

##                                 Model  Train_R2   Test_R2 Train_RMSE Test_RMSE
## 1                   Interaction Terms 0.6187726 0.4706976  0.1054870 0.1300389
## 2                 Feature Engineering 0.4023797 0.3678177  0.1320747 0.1421158
## 3         MLR (All Features Included) 0.4016072 0.3677272  0.1321600 0.1421260
## 4 Recursive Feature Elimination (RFE) 0.4016072 0.3677272  0.1321600 0.1421260
## 5                Backward Elimination 0.4005068 0.3659733  0.1322815 0.1423230
## 6                   Forward Selection 0.4005068 0.3659733  0.1322815 0.1423230
## 7                  Stepwise Selection 0.4005068 0.3659733  0.1322815 0.1423230
## 8                 Polynomial Features 0.4054491 0.3635935  0.1317351 0.1425899

MLR Best Model Residual Analysis (Interaction Terms Model)

Residual Diagnostics Statement:

Residuals vs Fitted Plot: The residuals exhibit a random scatter around the zero line, indicating linearity. However, some slight heteroscedasticity may be present as the spread of residuals appears to increase slightly for higher fitted values.
Q-Q Plot: The residuals generally follow the theoretical quantiles, suggesting that the normality assumption is approximately satisfied, though some deviation is observed in the tails.
Scale-Location Plot: The spread of standardized residuals is relatively consistent across fitted values, supporting the homoscedasticity assumption.
Residuals vs Leverage Plot: A few high-leverage points are present, as indicated by Cook’s distance. These points may influence the model significantly and warrant further investigation.

Overall, the diagnostic plots indicate that the model assumptions are reasonably met, though some high-leverage points may require attention.

# Generate diagnostic plots for the training model
par(mfrow = c(2, 2))
plot(interaction_model)

par(mfrow = c(1, 1))  # Reset plotting layout

# summary(interaction_model)

Partial Least Squares (PLS) Models

PLS Modeling Process

Data Preparation:
- The data was split into training (80%) and testing (20%) sets.
- Predictors (train_X, test_X) and the response variable (train_y, test_y) were prepared.
Model Training and Evaluation:
- PLS and PCR: Partial Least Squares (PLS) and Principal Component Regression (PCR) were trained using cross-validation to select the optimal number of components. Predictions were made for both training and testing datasets, and RMSE and \(R^2\) were calculated.
- Ridge and Lasso Regression: Ridge (alpha=0) and Lasso (alpha=1) regression models were trained using cross-validation to select the optimal regularization parameter (\(\lambda\)). Predictions were made on the datasets, and metrics were computed.
- Elastic Net: Elastic Net, combining Lasso and Ridge, was tuned over a grid of \(\lambda\) and \(\alpha\) values using cross-validation. Predictions and metrics were computed.
Result Compilation:
- RMSE and \(R^2\) for both training and testing datasets were consolidated into a DataFrame for all models.
- Models were ranked by Test RMSE in ascending order.

Partial Least Squares Model Performance

Based on the sorted DataFrame output, Lasso Regression is the best-performing model, having the lowest Test RMSE (\(0.8078\)) and competitive \(R^2\) values. This suggests that it balances bias and variance effectively for this dataset.

If further analysis is required or if you want visualizations of model performances or additional fine-tuning, let me know!

# Install required packages
if (!requireNamespace("pls", quietly = TRUE)) {
  install.packages("pls")
}
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("glmnet", quietly = TRUE)) {
  install.packages("glmnet")
}

# Load required libraries
library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:corrplot':
## 
##     corrplot

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

library(caret)
library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-8

# Prepare predictors (X) and response (y)
train_X <- as.matrix(ols_train_data[, -which(names(ols_train_data) == "PH")])
train_y <- ols_train_data$PH
test_X <- as.matrix(ols_test_data[, -which(names(ols_test_data) == "PH")])
test_y <- ols_test_data$PH

### 1. Partial Least Squares Regression (PLSR)
# Fit PLS model with cross-validation
pls_model <- plsr(PH ~ ., data = ols_train_data, validation = "CV")

# Determine the optimal number of components
optimal_ncomp <- selectNcomp(pls_model, method = "onesigma", plot = FALSE)

# Predict on the training and test sets using the optimal number of components
pls_train_predictions <- predict(pls_model, newdata = train_X, ncomp = optimal_ncomp)
pls_test_predictions <- predict(pls_model, newdata = test_X, ncomp = optimal_ncomp)

# Evaluate the PLS model
pls_train_rmse <- RMSE(pls_train_predictions, train_y)
pls_test_rmse <- RMSE(pls_test_predictions, test_y)
pls_train_r2 <- cor(pls_train_predictions, train_y)^2
pls_test_r2 <- cor(pls_test_predictions, test_y)^2

### 2. Principal Component Regression (PCR)
# Fit PCR model with cross-validation
pcr_model <- pcr(PH ~ ., data = ols_train_data, validation = "CV")

# Determine the optimal number of components
optimal_ncomp_pcr <- selectNcomp(pcr_model, method = "onesigma", plot = FALSE)

# Predict on the training and test sets using the optimal number of components
pcr_train_predictions <- predict(pcr_model, newdata = train_X, ncomp = optimal_ncomp_pcr)
pcr_test_predictions <- predict(pcr_model, newdata = test_X, ncomp = optimal_ncomp_pcr)

# Evaluate the PCR model
pcr_train_rmse <- RMSE(pcr_train_predictions, train_y)
pcr_test_rmse <- RMSE(pcr_test_predictions, test_y)
pcr_train_r2 <- cor(pcr_train_predictions, train_y)^2
pcr_test_r2 <- cor(pcr_test_predictions, test_y)^2

### 3. Ridge Regression
ridge_model <- glmnet(train_X, train_y, alpha = 0) # alpha = 0 for Ridge
cv_ridge <- cv.glmnet(train_X, train_y, alpha = 0) # Cross-validation for Ridge
ridge_lambda <- cv_ridge$lambda.min # Optimal lambda

# Predict on the training and test sets
ridge_train_predictions <- predict(cv_ridge, newx = train_X, s = ridge_lambda)
ridge_test_predictions <- predict(cv_ridge, newx = test_X, s = ridge_lambda)

# Evaluate Ridge model
ridge_train_rmse <- RMSE(ridge_train_predictions, train_y)
ridge_test_rmse <- RMSE(ridge_test_predictions, test_y)
ridge_train_r2 <- cor(ridge_train_predictions, train_y)^2
ridge_test_r2 <- cor(ridge_test_predictions, test_y)^2

### 4. Lasso Regression
lasso_model <- glmnet(train_X, train_y, alpha = 1) # alpha = 1 for Lasso
cv_lasso <- cv.glmnet(train_X, train_y, alpha = 1) # Cross-validation for Lasso
lasso_lambda <- cv_lasso$lambda.min # Optimal lambda

# Predict on the training and test sets
lasso_train_predictions <- predict(cv_lasso, newx = train_X, s = lasso_lambda)
lasso_test_predictions <- predict(cv_lasso, newx = test_X, s = lasso_lambda)

# Evaluate Lasso model
lasso_train_rmse <- RMSE(lasso_train_predictions, train_y)
lasso_test_rmse <- RMSE(lasso_test_predictions, test_y)
lasso_train_r2 <- cor(lasso_train_predictions, train_y)^2
lasso_test_r2 <- cor(lasso_test_predictions, test_y)^2

### 5. Elastic Net
elastic_net_model <- train(
  x = train_X, y = train_y,
  method = "glmnet",
  tuneLength = 10,
  trControl = trainControl(method = "cv", number = 10),
  preProc = c("center", "scale")
)

# Best Elastic Net model
best_enet <- elastic_net_model$finalModel
best_enet_lambda <- elastic_net_model$bestTune$lambda
best_enet_alpha <- elastic_net_model$bestTune$alpha

# Predict on the training and test sets
enet_train_predictions <- predict(best_enet, newx = train_X, s = best_enet_lambda)
enet_test_predictions <- predict(best_enet, newx = test_X, s = best_enet_lambda)

# Evaluate Elastic Net model
enet_train_rmse <- RMSE(enet_train_predictions, train_y)
enet_test_rmse <- RMSE(enet_test_predictions, test_y)
enet_train_r2 <- cor(enet_train_predictions, train_y)^2
enet_test_r2 <- cor(enet_test_predictions, test_y)^2

### Combine Results into a Data Frame
results_df <- data.frame(
  Model = c("PLS", "PCR", "Ridge Regression", "Lasso Regression", "Elastic Net"),
  Train_R2 = c(pls_train_r2, pcr_train_r2, ridge_train_r2, lasso_train_r2, enet_train_r2),
  Test_R2 = c(pls_test_r2, pcr_test_r2, ridge_test_r2, lasso_test_r2, enet_test_r2),
  Train_RMSE = c(pls_train_rmse, pcr_train_rmse, ridge_train_rmse, lasso_train_rmse, enet_train_rmse),
  Test_RMSE = c(pls_test_rmse, pcr_test_rmse, ridge_test_rmse, lasso_test_rmse, enet_test_rmse)
)

# Sort metrics_df by Test_RMSE in ascending order
results_df <- results_df %>%
  arrange(Test_RMSE)

# Display the sorted DataFrame
results_df %>%
  kable(
    caption = "Summary of Partial Least Squares Models Performance"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE  # Set to FALSE for custom widths
  ) %>%
  column_spec(1, width = "5cm") %>%  # Adjust the width of the first column
  column_spec(2:5, width = "3cm")  # Adjust the width of columns 2 to 5

Summary of Partial Least Squares Models Performance
Model	Train_R2	Test_R2	Train_RMSE	Test_RMSE
Lasso Regression	0.4015648	0.3683422	0.1321666	0.1421016
Ridge Regression	0.4003721	0.3695003	0.1323733	0.1421150
PCR	0.3889516	0.3659666	0.1335502	0.1423654
PLS	0.3864426	0.3606606	0.1338241	0.1429587
Elastic Net	0.0442056	0.0235254	194.6793992	194.0983447

Residual Analysis, Variable Importance and Cross-Validation of the Best PLS (Lasso) Model

The diagnostic plots and the cross-validation summary for the Lasso model reveal the following:

Residual Analysis:
- The residuals vs. fitted plot indicates no obvious patterns, suggesting that the model does not suffer from severe non-linearity or heteroscedasticity.
- The Q-Q plot indicates that the residuals approximately follow a normal distribution, with minor deviations at the tails.
- The scale-location plot (√|Standardized Residuals| vs. Fitted Values) shows that the variance is relatively constant, further supporting the assumption of homoscedasticity.
- The residuals vs. leverage plot does not show influential points with high leverage, suggesting stability in the model.
Cross-Validation Results:
- The optimal lambda value for minimizing mean squared error (MSE) is 0.001266, which resulted in 23 nonzero predictors being retained in the model.
- The lambda at the one standard error rule (λ.1se = 0.024844) selects a simpler model with 17 predictors, which sacrifices minimal predictive performance for increased parsimony.
Conclusion: The Lasso model achieves a balance between model complexity and predictive accuracy. It reduces the number of predictors to avoid overfitting while maintaining satisfactory performance metrics. These characteristics make it a robust choice for modeling, especially when interpretability and variable selection are priorities.

1. Residual Plots for Best PLS (Lasso) Model

# Generate diagnostic plots for the Lasso model
par(mfrow = c(2, 2))  # Set layout for 2x2 plots

# Residuals vs. Fitted
plot(
  as.numeric(lasso_test_predictions), 
  as.numeric(test_y - lasso_test_predictions),
  xlab = "Fitted Values",
  ylab = "Residuals",
  main = "Residuals vs Fitted",
  pch = 20,
  col = "black"
)
abline(h = 0, col = "red", lty = 2)

# Q-Q Plot
qqnorm(as.numeric(test_y - lasso_test_predictions),
       main = "Normal Q-Q Plot",
       pch = 20,
       col = "black")
qqline(as.numeric(test_y - lasso_test_predictions), col = "red")

# Scale-Location Plot
std_residuals <- scale(as.numeric(test_y - lasso_test_predictions))
plot(
  as.numeric(lasso_test_predictions),
  sqrt(abs(std_residuals)),
  xlab = "Fitted Values",
  ylab = "√|Standardized Residuals|",
  main = "Scale-Location",
  pch = 20,
  col = "black"
)
abline(h = 0, col = "red", lty = 2)

# Residuals vs Leverage Plot (approximation for test data)
hat_values_test <- diag(test_X %*% solve(t(train_X) %*% train_X) %*% t(test_X))  # Approx. Hat matrix for test
plot(
  hat_values_test,
  std_residuals,
  xlab = "Leverage",
  ylab = "Standardized Residuals",
  main = "Residuals vs Leverage",
  pch = 20,
  col = "black"
)
abline(h = 0, col = "red", lty = 2)

par(mfrow = c(1, 1))  # Reset plotting layout

# Print summary of the Lasso model
print(cv_lasso)

## 
## Call:  cv.glmnet(x = train_X, y = train_y, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.000201    65 0.01792 0.000795      24
## 1se 0.005715    29 0.01866 0.000794      15

1. Variable Importance for Best PLS (Lasso) Model

The Lasso Variable Importance plot shows that Fill.Pressure has the highest importance, followed by Carb.Pressure, Oxygen.Filler, and PC.Volume, while variables like Pressure.Vacuum and Carb.Flow have minimal or no importance in the model.

# Extract coefficients for Lasso model
lasso_coefficients <- coef(cv_lasso, s = lasso_lambda)

# Create a data frame for variable importance (absolute coefficients)
importance_df_lasso <- data.frame(
  Variable = rownames(lasso_coefficients),
  Importance = abs(as.numeric(lasso_coefficients))
)

# Remove intercept from the importance calculation
importance_df_lasso <- importance_df_lasso[importance_df_lasso$Variable != "(Intercept)", ]

# Sort by importance
importance_df_lasso <- importance_df_lasso[order(-importance_df_lasso$Importance), ]

# Plot variable importance for Lasso
library(ggplot2)
ggplot(importance_df_lasso, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_point() +
  coord_flip() +
  labs(title = "Lasso Variable Importance", x = "Variables", y = "Importance") +
  theme_minimal()

2. Cross-Validation for Lasso or Elastic Net

The Cross-Validation Profile for Lasso Regression shows that the mean-squared error (MSE) is minimized at a specific λ (lambda) value around Log(λ) ≈ -6, indicating the optimal regularization parameter that balances model complexity and error.

# Lasso Regression
cv_lasso <- cv.glmnet(train_X, train_y, alpha = 1)

# Elastic Net
cv_enet <- train(
  x = train_X, y = train_y,
  method = "glmnet",
  tuneGrid = expand.grid(
    alpha = seq(0, 1, by = 0.1),
    lambda = seq(0.001, 0.1, length.out = 10)
  ),
  trControl = trainControl(method = "cv")
)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# Extract Best Model and Plot
plot(cv_lasso, main = "Cross-Validation Profile for Lasso Regression")

Non-Linear Regression Models

Modeling Data and Function

The code prepares data, splits it into training and testing sets, and defines a function to compute AIC and BIC for a given model based on its predictions, residuals, and parameters.

# Install required packages
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("earth", quietly = TRUE)) {
  install.packages("earth")
}
if (!requireNamespace("e1071", quietly = TRUE)) {
  install.packages("e1071")
}
if (!requireNamespace("nnet", quietly = TRUE)) {
  install.packages("nnet")
}

# Load libraries
library(caret)
library(earth)  # For MARS

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

library(e1071)  # For SVR
library(nnet)   # For ANN

# Set seed for reproducibility
set.seed(123)

# Data preparation
# Assuming gradient_based_data or tree_based_data is already loaded
data <- statistical_models_data
train_indices <- createDataPartition(data$PH, p = 0.8, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
train_X <- train_data[, -which(names(train_data) == "PH")]
train_y <- train_data$PH
test_X <- test_data[, -which(names(test_data) == "PH")]
test_y <- test_data$PH

# Function to compute AIC and BIC
compute_aic_bic <- function(model, test_X, test_y) {
  predictions <- predict(model, newdata = test_X)
  residuals <- test_y - predictions
  n <- length(test_y)
  k <- length(coef(model))
  mse <- mean(residuals^2)
  loglik <- -n / 2 * log(2 * pi * mse) - n / 2
  aic <- -2 * loglik + 2 * k
  bic <- -2 * loglik + log(n) * k
  return(c(AIC = aic, BIC = bic))
}

KNN Model

The optimal K for the KNN model is 7, achieving a Test RMSE of 0.0834 and Test R² of 0.5246. The RMSE profile shows increasing error for K > 7, confirming this as the best configuration. AIC and BIC are -1092.749, indicating a well-balanced model.

# Load necessary libraries
library(ggplot2)

### 1. K-Nearest Neighbors (KNN)
knn_model <- train(
  x = train_X, y = train_y,
  method = "knn",
  tuneLength = 10,
  trControl = trainControl(method = "cv", number = 10)  # Removed preProc
)

# Generate predictions on the test set
knn_predictions <- predict(knn_model, test_X)

# Calculate RMSE
knn_rmse <- RMSE(knn_predictions, test_y)

# Calculate R-squared
knn_r2 <- cor(knn_predictions, test_y)^2

# Compute AIC and BIC
knn_aic_bic <- compute_aic_bic(knn_model, test_X, test_y)

# Extract the optimal value of K
optimal_k <- knn_model$bestTune$k

# Print the results
cat("KNN Model Results:\n")

## KNN Model Results:

cat("Optimal K:", optimal_k, "\n")

## Optimal K: 5

cat("Test RMSE:", knn_rmse, "\n")

## Test RMSE: 0.1286795

cat("Test R^2:", knn_r2, "\n")

## Test R^2: 0.4892592

cat("AIC:", knn_aic_bic["AIC"], "\n")

## AIC: -647.9108

cat("BIC:", knn_aic_bic["BIC"], "\n")

## BIC: -647.9108

# Extract the results from the KNN model
knn_results_df <- knn_model$results

# Create a plot of RMSE vs k for the KNN model
ggplot(knn_results_df, aes(x = k, y = RMSE)) +
  geom_line() +
  geom_point(size = 2, color = "blue") +
  labs(
    title = "RMSE Profiles for the K-Nearest Neighbors Model",
    x = "Number of Neighbors (k)",
    y = "RMSE (Cross-Validation)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

### Residual analysis of the KNN Model

This residual analysis evaluates the performance and assumptions of the KNN model by examining residual patterns and prediction accuracy. It includes:

Residuals vs. Predicted Values: Checks for systematic bias in predictions.
Residual Distribution: Assesses the normality of residuals, a key assumption for model evaluation.
Predicted vs. Actual Values: Highlights how closely predictions match actual outcomes.
QQ Plot: Examines residual normality visually, detecting deviations from the expected distribution.

The combined analysis provides insights into the KNN model’s predictive reliability and adherence to assumptions.

# Install required packages
if (!requireNamespace("ggplotify", quietly = TRUE)) {
  install.packages("ggplotify")
}

# Load necessary libraries
library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Prepare the plots

# 1. Residuals vs. Predicted Values
knn_residuals <- test_y - knn_predictions
plot_residuals_vs_predicted <- ggplot(data.frame(Predicted = knn_predictions, Residuals = knn_residuals), 
                                      aes(x = Predicted, y = Residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Residuals vs. Predicted Values",
    x = "Predicted Values",
    y = "Residuals"
  ) +
  theme_minimal()

# 2. Histogram of Residuals
plot_residuals_hist <- ggplot(data.frame(Residuals = knn_residuals), aes(x = Residuals)) +
  geom_histogram(bins = 30, color = "black", fill = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Residuals",
    x = "Residuals",
    y = "Frequency"
  ) +
  theme_minimal()

# 3. Scatter Plot of Predicted vs. Actual Values
results <- data.frame(Actual = test_y, Predicted = knn_predictions)
plot_predicted_vs_actual <- ggplot(data = results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "black", alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 1) +
  labs(
    title = "KNN Predictions vs. Actual Values",
    x = "Actual Values",
    y = "Predicted Values"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Display the QQ Plot
qq_plot <- function() {
  qqnorm(knn_residuals, main = "QQ Plot of Residuals")
  qqline(knn_residuals, col = "red")
}

# Arrange the plots in a grid
grid.arrange(
  plot_residuals_vs_predicted, 
  plot_residuals_hist, 
  plot_predicted_vs_actual,
  ggplotify::as.ggplot(qq_plot), # Convert the base R plot into a ggplot-compatible object
  ncol = 2
)

SVM Model

The Support Vector Regression (SVR) model demonstrated the best performance with the following optimal parameters:

Sigma (σ): 0.02630081
Cost (C): 4

The Test RMSE achieved was 0.08066137, and the Test \(R^2\) was 0.5555124, indicating a strong predictive performance. The validation curve illustrates that the RMSE decreases as \(C\) approaches the optimal value but increases beyond it, affirming the selected optimal configuration. Statistical evaluation metrics further support the model’s fit with calculated AIC, AICc, and BIC values.

# Load necessary libraries
library(ggplot2)

### 2. Support Vector Regression (SVR)
svr_model <- train(
  x = train_X, y = train_y,
  method = "svmRadial",
  tuneLength = 10,
  trControl = trainControl(method = "cv", number = 10)  # Removed preProc
)

# Generate predictions on the test set
svr_predictions <- predict(svr_model, test_X)

# Calculate performance metrics
svr_rmse <- RMSE(svr_predictions, test_y)
svr_r2 <- cor(svr_predictions, test_y)^2

# AIC/BIC computation function for SVR
compute_aic_bic_svr <- function(model, test_X, test_y) {
  predictions <- predict(model, newdata = test_X)
  residuals <- test_y - predictions
  n <- length(test_y)  # Number of observations
  mse <- mean(residuals^2)  # Mean squared error
  
  # Use the number of support vectors as the number of parameters (proxy)
  k <- length(model$finalModel@alpha)  # Number of support vectors
  
  loglik <- -n / 2 * log(2 * pi * mse) - n / 2  # Log-likelihood
  aic <- -2 * loglik + 2 * k  # AIC
  aicc <- aic + (2 * k * (k + 1)) / (n - k - 1)  # Corrected AIC
  bic <- -2 * loglik + log(n) * k  # BIC
  return(c(AIC = aic, AICc = aicc, BIC = bic))
}

# Compute AIC, AICc, and BIC for the SVR model
svr_aic_bic <- compute_aic_bic_svr(svr_model, test_X, test_y)

# Extract optimal parameters from the model
optimal_params <- svr_model$bestTune  # Best hyperparameters for SVR

# Display SVR results
cat("SVR Results:\n")

## SVR Results:

cat("Optimal Parameters:\n")

## Optimal Parameters:

print(optimal_params)

##        sigma C
## 5 0.02667853 4

cat("Test RMSE:", svr_rmse, "\n")

## Test RMSE: 0.1216659

cat("Test R^2:", svr_r2, "\n")

## Test R^2: 0.53815

cat("AIC:", svr_aic_bic["AIC"], "\n")

## AIC: 2794.586

cat("AICc:", svr_aic_bic["AICc"], "\n")

## AICc: -2155.737

cat("BIC:", svr_aic_bic["BIC"], "\n")

## BIC: 10215.07

# Extract the results from the SVR model
results_df <- svr_model$results

# Create a plot of RMSE vs Cost (C) for the SVR model
ggplot(results_df, aes(x = C, y = RMSE, color = as.factor(sigma))) +
  geom_line() +
  geom_point(size = 2) +
  labs(
    title = "RMSE Profiles for the Support Vector Regression Model",
    x = "Cost (C)",
    y = "RMSE (Cross-Validation)",
    color = "Sigma (σ)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "right"
  )

Residual Analysis of SVR Model

Residual analysis evaluates the performance and assumptions of a fitted Support Vector Regression (SVR) model. Below are key observations based on the generated plots:

Residuals vs. Predicted Values: The scatterplot shows residuals are relatively scattered around zero, indicating reasonable model fit. However, slight patterns suggest potential non-linear relationships or missing interactions.
Residual Distribution: The histogram reveals residuals are centered around zero but exhibit slight skewness, indicating the model’s error terms are not perfectly normally distributed.
QQ Plot: The QQ plot suggests minor deviations from normality in the residuals, particularly at the extremes, which could indicate outliers or heteroscedasticity.
Predicted vs. Actual Values: The scatterplot demonstrates predicted values closely align with actual values, reflected by points clustering near the diagonal, confirming good overall model performance.

These insights highlight areas of improvement for fine-tuning the SVR model or feature engineering.

# Install required packages
if (!requireNamespace("ggplotify", quietly = TRUE)) {
  install.packages("ggplotify")
}

# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Calculate residuals
svr_residuals <- test_y - svr_predictions

plot_residuals_vs_predicted <- ggplot(data.frame(Predicted = svr_predictions, Residuals = svr_residuals), 
                                      aes(x = Predicted, y = Residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Residuals vs. Predicted Values",
    x = "Predicted Values",
    y = "Residuals"
  ) +
  theme_minimal()

# 2. Histogram of Residuals
plot_residuals_hist <- ggplot(data.frame(Residuals = svr_residuals), aes(x = Residuals)) +
  geom_histogram(bins = 30, color = "black", fill = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Residuals",
    x = "Residuals",
    y = "Frequency"
  ) +
  theme_minimal()

# 3. Scatter Plot of Predicted vs. Actual Values
results <- data.frame(Actual = test_y, Predicted = svr_predictions)
plot_predicted_vs_actual <- ggplot(data = results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "black", alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 1) +
  labs(
    title = "SVR Predictions vs. Actual Values",
    x = "Actual Values",
    y = "Predicted Values"
  ) +
  theme_minimal()

# Display the QQ Plot
qq_plot <- function() {
  qqnorm(svr_residuals, main = "QQ Plot of Residuals")
  qqline(svr_residuals, col = "red")
}

# Arrange the plots in a grid
grid.arrange(
  plot_residuals_vs_predicted, 
  plot_residuals_hist, 
  plot_predicted_vs_actual,
  ggplotify::as.ggplot(qq_plot), 
  ncol = 2
)

ANN Model

The optimal number of hidden units for the ANN is approximately 12, minimizing RMSE. The final ANN model, with 15 hidden units and decay of 0.0178, achieved:

Train RMSE: 0.0808, Test RMSE: 0.0826
Train R²: 0.513, Test R²: 0.533
AIC: -290.48, AICc: 2827.30, BIC: 1431.07

This configuration balances accuracy and generalization effectively.

### Artificial Neural Network (ANN) Model with Validation Curve
library(caret)
library(nnet)
library(ggplot2)

# Set seed for reproducibility
set.seed(123)

# Define a grid for number of hidden units (size) using the optimal decay = 0.01778279
validation_grid <- expand.grid(size = seq(2, 18, by = 1), decay = 0.01778279)

# Train ANN model with the validation grid
ann_model <- train(
  x = train_X, y = train_y,
  method = "nnet",
  tuneGrid = validation_grid,  # Test varying hidden units
  trControl = trainControl(method = "cv", number = 10),  # 10-fold cross-validation
  linout = TRUE,  # Linear output for regression tasks
  trace = FALSE
)

# Generate predictions on the test set
ann_predictions <- predict(ann_model, test_X)

# Calculate performance metrics
ann_rmse <- RMSE(ann_predictions, test_y)
ann_r2 <- cor(ann_predictions, test_y)^2

# Compute AIC, AICc, and BIC for ANN
compute_aic_bic <- function(model, test_X, test_y) {
  predictions <- predict(model, newdata = test_X)
  residuals <- test_y - predictions
  n <- length(test_y)  # Number of observations
  k <- length(coef(model$finalModel))  # Number of parameters in the final model
  mse <- mean(residuals^2)  # Mean squared error
  loglik <- -n / 2 * log(2 * pi * mse) - n / 2  # Log-likelihood
  aic <- -2 * loglik + 2 * k  # AIC
  aicc <- aic + (2 * k * (k + 1)) / (n - k - 1)  # Corrected AIC
  bic <- -2 * loglik + log(n) * k  # BIC
  return(c(AIC = aic, AICc = aicc, BIC = bic))
}

# Compute AIC, AICc, and BIC
ann_aic_bic <- compute_aic_bic(ann_model, test_X, test_y)

# Extract the optimal configuration from ann_model$results
optimal_config <- ann_model$results[which.min(ann_model$results$RMSE), ]

# Extract the optimal number of hidden units
best_hidden_units <- optimal_config$size



# Display results
cat("ANN Results:\n")

## ANN Results:

cat("Optimal Number of Hidden Units (Programmatically Extracted):", best_hidden_units, "\n")

## Optimal Number of Hidden Units (Programmatically Extracted): 9

cat("Train RMSE:", ann_model$results$RMSE[which.min(ann_model$results$RMSE)], "\n")

## Train RMSE: 0.1248814

cat("Test RMSE:", ann_rmse, "\n")

## Test RMSE: 0.1276438

cat("Train R^2:", ann_model$results$Rsquared[which.min(ann_model$results$RMSE)], "\n")

## Train R^2: 0.4810389

cat("Test R^2:", ann_r2, "\n")

## Test R^2: 0.4972056

cat("AIC:", ann_aic_bic["AIC"], "\n")

## AIC: -186.2019

cat("AICc:", ann_aic_bic["AICc"], "\n")

## AICc: 214.2313

cat("BIC:", ann_aic_bic["BIC"], "\n")

## BIC: 810.263

# Extract the results from the model
results_df <- ann_model$results

# Plot RMSE vs. Number of Hidden Units
ggplot(results_df, aes(x = size, y = RMSE)) +
  geom_line(color = "blue") +
  geom_point(size = 2, color = "red") +
  labs(
    title = "Validation Curve: RMSE vs. # Hidden Units (ANN)",
    x = "# Hidden Units",
    y = "RMSE (Cross-Validation)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Residual Analysis of ANN Model

Residual analysis evaluates the performance and assumptions of a fitted Support Vector Regression (SVR) model. Below are key observations based on the generated plots:

Residuals vs. Predicted Values: The scatterplot shows residuals are relatively scattered around zero, indicating reasonable model fit. However, slight patterns suggest potential non-linear relationships or missing interactions.
Residual Distribution: The histogram reveals residuals are centered around zero but exhibit slight skewness, indicating the model’s error terms are not perfectly normally distributed.
QQ Plot: The QQ plot suggests minor deviations from normality in the residuals, particularly at the extremes, which could indicate outliers or heteroscedasticity.
Predicted vs. Actual Values: The scatterplot demonstrates predicted values closely align with actual values, reflected by points clustering near the diagonal, confirming good overall model performance.

These insights highlight areas of improvement for fine-tuning the SVR model or feature engineering.

# Install required packages
if (!requireNamespace("ggplotify", quietly = TRUE)) {
  install.packages("ggplotify")
}

# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Prepare the plots

# 1. Residuals vs. Predicted Values
ann_residuals <- test_y - ann_predictions

plot_residuals_vs_predicted <- ggplot(data.frame(Predicted = ann_predictions, Residuals = ann_residuals), 
                                      aes(x = Predicted, y = Residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Residuals vs. Predicted Values",
    x = "Predicted Values",
    y = "Residuals"
  ) +
  theme_minimal()

# 2. Histogram of Residuals
plot_residuals_hist <- ggplot(data.frame(Residuals = ann_residuals), aes(x = Residuals)) +
  geom_histogram(bins = 30, color = "black", fill = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Residuals",
    x = "Residuals",
    y = "Frequency"
  ) +
  theme_minimal()

# 3. Scatter Plot of Predicted vs. Actual Values
results <- data.frame(Actual = test_y, Predicted = ann_predictions)
plot_predicted_vs_actual <- ggplot(data = results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "black", alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 1) +
  labs(
    title = "ANN Predictions vs. Actual Values",
    x = "Actual Values",
    y = "Predicted Values"
  ) +
  theme_minimal()

# Display the QQ Plot
qq_plot <- function() {
  qqnorm(ann_residuals, main = "QQ Plot of Residuals")
  qqline(ann_residuals, col = "red")
}

# Arrange the plots in a grid
grid.arrange(
  plot_residuals_vs_predicted, 
  plot_residuals_hist, 
  plot_predicted_vs_actual,
  ggplotify::as.ggplot(qq_plot), 
  ncol = 2
)

MARS Model

The MARS model shows moderate predictive performance with the following results: - The optimal number of basis functions is 18, balancing model complexity and accuracy. - The Test RMSE (0.1327) is close to the Train RMSE (0.1288), indicating low overfitting. - The R² values (Train: 0.4347, Test: 0.4501) suggest the model explains about 45% of the variance in the test set. - The AIC (-584.07) and BIC (-516.23) indicate a well-fitted model with good generalization ability.

# Load necessary libraries
library(ggplot2)

### 4. Multivariate Adaptive Regression Splines (MARS)
# Assuming tree_based_data is used for MARS
data <- tree_based_data  # Switch to untransformed data for MARS
mars_train_indices <- createDataPartition(data$PH, p = 0.8, list = FALSE)
mars_train_data <- data[train_indices, ]
mars_test_data <- data[-train_indices, ]
mars_train_X <- mars_train_data[, setdiff(names(mars_train_data), "PH")]
mars_train_y <- mars_train_data$PH
mars_test_X <- mars_test_data[, setdiff(names(mars_test_data), "PH")]
mars_test_y <- mars_test_data$PH




mars_model <- train(
  x = mars_train_X, y = mars_train_y,
  method = "earth",
  tuneLength = 10,
  trControl = trainControl(method = "cv", number = 10)  
)

# Generate predictions on the test set
mars_predictions <- predict(mars_model, mars_test_X)

# Calculate performance metrics
mars_rmse <- RMSE(mars_predictions, mars_test_y)
mars_r2 <- cor(mars_predictions, mars_test_y)^2

# Compute AIC and BIC for MARS
mars_aic_bic <- compute_aic_bic(mars_model, mars_test_X, mars_test_y)

# Extract the optimal configuration from mars_model$results
optimal_config <- mars_model$results[which.min(mars_model$results$RMSE), ]

# Extract the optimal number of basis functions (if applicable)
best_basis_functions <- optimal_config$nprune

# Display results
cat("MARS Model Results:\n")

## MARS Model Results:

cat("Optimal Number of Basis Functions:", best_basis_functions, "\n")

## Optimal Number of Basis Functions: 18

cat("Train RMSE:", mars_model$results$RMSE[which.min(mars_model$results$RMSE)], "\n")

## Train RMSE: 0.1288138

cat("Test RMSE:", mars_rmse, "\n")

## Test RMSE: 0.132735

cat("Train R^2:", mars_model$results$Rsquared[which.min(mars_model$results$RMSE)], "\n")

## Train R^2: 0.4347175

cat("Test R^2:", mars_r2, "\n")

## Test R^2: 0.4500877

cat("AIC:", mars_aic_bic["AIC"], "\n")

## AIC: -584.0745

cat("AICc:", mars_aic_bic["AICc"], "\n")

## AICc: -582.9777

cat("BIC:", mars_aic_bic["BIC"], "\n")

## BIC: -516.2301

# Extract the results from the MARS model
mars_results_df <- mars_model$results


# Create a plot of RMSE vs. nprune for each degree
ggplot(mars_results_df, aes(x = nprune, y = RMSE, color = factor(degree))) +
  geom_line() +
  geom_point(size = 2) +
  labs(
    title = "RMSE Profiles for the MARS Model",
    x = "Number of Pruned Terms (nprune)",
    y = "RMSE (Cross-Validation)",
    color = "Degree"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Non-linear Performance Matrix.

The Summary of Non-linear Models Performance highlights the following key observations:

Support Vector Regression (SVR) achieved the best performance with the lowest Test RMSE (0.1217) and the highest Test R² (0.5382), despite its high AIC and BIC.
Artificial Neural Network (ANN) followed closely with a Test RMSE of 0.1276 and a Test R² of 0.4972, showing a good balance of performance and complexity.
K-Nearest Neighbors (KNN) and MARS exhibited slightly higher Test RMSE values (0.1287 and 0.1327, respectively), with lower R² values.
MARS has the lowest AIC and BIC among all models, indicating a better tradeoff between model performance and complexity.

Overall, SVR emerges as the most accurate model, while MARS is the most efficient in terms of model simplicity.

# Combine results into a data frame
results_df <- data.frame(
  Model = c("KNN", "MARS", "SVR", "ANN"),
  Train_RMSE = c(knn_model$results$RMSE[which.min(knn_model$results$RMSE)],
                 mars_model$results$RMSE[which.min(mars_model$results$RMSE)],
                 svr_model$results$RMSE[which.min(svr_model$results$RMSE)],
                 ann_model$results$RMSE[which.min(ann_model$results$RMSE)]
                 ),
  Test_RMSE = c(knn_rmse, mars_rmse, svr_rmse, ann_rmse),
  Train_R2 = c(knn_model$results$Rsquared[which.min(knn_model$results$RMSE)],
               mars_model$results$Rsquared[which.min(mars_model$results$RMSE)],
               svr_model$results$Rsquared[which.min(svr_model$results$RMSE)],
               ann_model$results$Rsquared[which.min(ann_model$results$RMSE)]
               ),
  Test_R2 = c(knn_r2, mars_r2, svr_r2, ann_r2),
  AIC = c(knn_aic_bic["AIC"], mars_aic_bic["AIC"], svr_aic_bic["AIC"], ann_aic_bic["AIC"]),
  BIC = c(knn_aic_bic["BIC"], mars_aic_bic["BIC"], svr_aic_bic["BIC"], ann_aic_bic["BIC"])
)

# Sort by Test_RMSE
results_df <- results_df[order(results_df$Test_RMSE), ]

# Display results
results_df %>%
  kable(
    caption = "Summary of Non-linear Models Performance"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE  # Set to FALSE for custom widths
  ) %>%
  column_spec(1, width = "5cm") %>%  # Adjust the width of the first column
  column_spec(2:5, width = "3cm")  # Adjust the width of columns 2 to 5

Summary of Non-linear Models Performance
	Model	Train_RMSE	Test_RMSE	Train_R2	Test_R2	AIC	BIC
3	SVR	0.1184443	0.1216659	0.5242925	0.5381500	2794.5859	10215.0686
4	ANN	0.1248814	0.1276438	0.4810389	0.4972056	-186.2019	810.2630
1	KNN	0.1258932	0.1286795	0.4680361	0.4892592	-647.9108	-647.9108
2	MARS	0.1288138	0.1327350	0.4347175	0.4500877	-584.0745	-516.2301

Regression Tree Models

Prepare Modeling Data

# Install required packages
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("ipred", quietly = TRUE)) {
  install.packages("ipred")
}
if (!requireNamespace("randomForest", quietly = TRUE)) {
  install.packages("randomForest")
}
if (!requireNamespace("gbm", quietly = TRUE)) {
  install.packages("gbm")
}
if (!requireNamespace("xgboost", quietly = TRUE)) {
  install.packages("xgboost")
}
# Install and load doParallel
if (!requireNamespace("doParallel", quietly = TRUE)) {
  install.packages("doParallel")
}

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

# Load libraries
library(caret)         # For model training and cross-validation
library(ipred)         # For Bagged Trees
library(randomForest)  # For Random Forest

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(gbm)           # For Gradient Boosting Machine

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(xgboost)       # For Extreme Gradient Boosting

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

# Set seed for reproducibility
set.seed(123)

# Data preparation
# Using tree_based_data for modeling
data <- tree_based_data

# Split the data into training and testing sets (80% training, 20% testing)
train_indices <- createDataPartition(data$PH, p = 0.8, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

# Separate predictors (X) and response (y)
train_X <- train_data[, -which(names(train_data) == "PH")]
train_y <- train_data$PH
test_X <- test_data[, -which(names(test_data) == "PH")]
test_y <- test_data$PH

# Function to compute AIC, BIC, and AICc
compute_aic_bic <- function(model, test_X, test_y) {
  predictions <- predict(model, newdata = test_X)
  residuals <- test_y - predictions
  n <- length(test_y)  # Sample size
  k <- length(coef(model))  # Number of parameters in the model
  mse <- mean(residuals^2)  # Mean squared error
  loglik <- -n / 2 * log(2 * pi * mse) - n / 2  # Log-likelihood
  
  # Compute AIC
  aic <- -2 * loglik + 2 * k
  
  # Compute BIC
  bic <- -2 * loglik + log(n) * k
  
  # Compute AICc (corrected AIC)
  if (n - k - 1 > 0) {
    aicc <- aic + (2 * k * (k + 1)) / (n - k - 1)
  } else {
    aicc <- NA  # AICc not defined if n - k - 1 <= 0
  }
  
  return(c(AIC = aic, BIC = bic, AICc = aicc))
}

Bagged Trees Model

The Bagged Trees Model achieved optimal performance with the following configuration:
- Min Split: 2
- Max Depth: 10
- Number of Bootstrap Trees: 20

Performance Metrics:
- RMSE: 0.1248
- R²: 0.5216
- AIC: -630.85
- BIC: -546.04

This configuration balances model complexity and predictive performance, with the lowest RMSE and solid generalization (R² ≈ 52%).

#### Bagged Trees Model
if (!requireNamespace("rpart", quietly = TRUE)) {
  install.packages("rpart")
}
library(ipred)
library(rpart)
library(ggplot2)

# Define a grid of parameters to tune
tune_grid <- expand.grid(
  minsplit = c(2, 5, 10),  # Minimum samples required for a split
  maxdepth = c(5, 10, 15), # Maximum depth of trees
  nbagg = c(10, 20, 30)    # Number of bootstrap trees
)

# Placeholder for results
results <- data.frame()

# Loop through parameter combinations
for (i in 1:nrow(tune_grid)) {
  set.seed(123)
  
  # Extract current parameter values
  params <- tune_grid[i, ]
  
  # Train Bagged Trees with specified parameters
  bagged_model <- bagging(
    train_y ~ ., data = data.frame(train_X, train_y),
    nbagg = params$nbagg,
    control = rpart.control(minsplit = params$minsplit, maxdepth = params$maxdepth)
  )
  
  # Predict on the test set
  predictions <- predict(bagged_model, newdata = test_X)
  bagged_rmse <- RMSE(predictions, test_y)
  bagged_r2 <- cor(predictions, test_y)^2
  
  # Store results
  results <- rbind(
    results,
    data.frame(
      minsplit = params$minsplit,
      maxdepth = params$maxdepth,
      nbagg = params$nbagg,
      RMSE = bagged_rmse,
      R2 = bagged_r2
    )
  )
}

# Identify the best configuration
best_config <- results[which.min(results$RMSE), ]
cat("Best Configuration:\n")

## Best Configuration:

print(best_config)

##    minsplit maxdepth nbagg      RMSE        R2
## 13        2       10    20 0.1248368 0.5216001

# Compute AIC and BIC for the best bagged model
compute_aic_bic_bagged <- function(model, test_X, test_y) {
  predictions <- predict(model, newdata = test_X)
  residuals <- test_y - predictions
  n <- length(test_y)  # Number of observations
  k <- length(model$mtrees)  # Proxy for number of parameters (number of trees)
  mse <- mean(residuals^2)  # Mean squared error
  loglik <- -n / 2 * log(2 * pi * mse) - n / 2  # Log-likelihood
  aic <- -2 * loglik + 2 * k  # AIC
  aicc <- aic + (2 * k * (k + 1)) / (n - k - 1)  # Corrected AIC
  bic <- -2 * loglik + log(n) * k  # BIC
  return(c(AIC = aic, AICc = aicc, BIC = bic))
}

# Train the best bagged model based on the optimal configuration
final_bagged_model <- bagging(
  train_y ~ ., data = data.frame(train_X, train_y),
  nbagg = best_config$nbagg,
  control = rpart.control(minsplit = best_config$minsplit, maxdepth = best_config$maxdepth)
)

# Compute AIC and BIC for the final model
bagged_aic_bic <- compute_aic_bic_bagged(final_bagged_model, test_X, test_y)

# Display the best configuration and metrics
cat("\nOptimal Bagged Trees Model Configuration:\n")

## 
## Optimal Bagged Trees Model Configuration:

cat("Min Split:", best_config$minsplit, "\n")

## Min Split: 2

cat("Max Depth:", best_config$maxdepth, "\n")

## Max Depth: 10

cat("Number of Bootstrap Trees:", best_config$nbagg, "\n")

## Number of Bootstrap Trees: 20

cat("Optimal RMSE:", best_config$RMSE, "\n")

## Optimal RMSE: 0.1248368

cat("Optimal R2:", best_config$R2, "\n")

## Optimal R2: 0.5216001

cat("AIC:", bagged_aic_bic["AIC"], "\n")

## AIC: -630.8485

cat("AICc:", bagged_aic_bic["AICc"], "\n")

## AICc: -629.1411

cat("BIC:", bagged_aic_bic["BIC"], "\n")

## BIC: -546.0429

# Enhanced plot with proper grouping, facetting, and distinct colors
ggplot(results, aes(
  x = nbagg, 
  y = RMSE, 
  group = interaction(minsplit, maxdepth),  # Group by minsplit and maxdepth
  color = as.factor(minsplit)  # Apply color to minsplit
)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  facet_wrap(~ maxdepth, labeller = label_both) +
  scale_color_manual(
    values = c("2" = "red", "5" = "green", "10" = "blue")  # Manually assign colors
  ) +
  labs(
    title = "RMSE Profiles for Tuned Bagged Trees Model",
    x = "Number of Bootstrap Trees (nbagg)",
    y = "RMSE (Cross-Validation)",
    color = "Min Split"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "right"
  )

Random Forest Model

The Random Forest Model achieves strong predictive performance with the following results:
- Optimal mtry (Number of Features Tried): 22
- Test RMSE: 0.1000, the lowest among models tested
- Test R²: 0.7011, explaining 70% of the variance in the test set

The RMSE Profile plot shows that RMSE decreases as mtry increases, stabilizing around mtry = 22, indicating optimal model tuning. The Random Forest model outperforms other models, making it the most accurate.

### Random Forest Model
library(randomForest)
library(doParallel)
library(ggplot2)

# Set up parallel processing
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

set.seed(123)

# Train Random Forest model
rf_model <- train(
  x = train_X, y = train_y,
  method = "rf",
  tuneLength = 10,  # Automatically tests 10 different mtry values
  trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
)

# Generate predictions
rf_predictions <- predict(rf_model, test_X)
rf_rmse <- RMSE(rf_predictions, test_y)
rf_r2 <- cor(rf_predictions, test_y)^2

# Extract optimal parameters
optimal_params_rf <- rf_model$bestTune

# Compute AIC and BIC for Random Forest
compute_aic_bic_rf <- function(model, test_X, test_y) {
  predictions <- predict(model, newdata = test_X)
  residuals <- test_y - predictions
  n <- length(test_y)  # Number of observations
  k <- model$finalModel$ntree  # Use the number of trees as a proxy for model complexity
  mse <- mean(residuals^2)  # Mean squared error
  loglik <- -n / 2 * log(2 * pi * mse) - n / 2  # Log-likelihood
  aic <- -2 * loglik + 2 * k  # AIC
  aicc <- aic + (2 * k * (k + 1)) / (n - k - 1)  # Corrected AIC
  bic <- -2 * loglik + log(n) * k  # BIC
  return(c(AIC = aic, AICc = aicc, BIC = bic))
}

rf_aic_bic <- compute_aic_bic_rf(rf_model, test_X, test_y)

# Stop parallel processing
stopCluster(cl)
registerDoSEQ()

# Plot RMSE vs. mtry (Number of Features Tried at Each Split)
rf_results <- rf_model$results
ggplot(rf_results, aes(x = mtry, y = RMSE)) +
  geom_line(color = "blue") +
  geom_point(size = 2) +
  labs(
    title = "RMSE Profiles for the Random Forest Model",
    x = "mtry (Number of Features Tried at Each Split)",
    y = "RMSE (Cross-Validation)"
  ) +
  theme_minimal()

# Print Results
cat("Random Forest Results:\n")

## Random Forest Results:

cat("Optimal Parameters (mtry):", optimal_params_rf$mtry, "\n")

## Optimal Parameters (mtry): 22

cat("Test RMSE:", rf_rmse, "\n")

## Test RMSE: 0.1000187

cat("Test R^2:", rf_r2, "\n")

## Test R^2: 0.701144

cat("AIC:", rf_aic_bic["AIC"], "\n")

## AIC: 93.57

cat("AICc:", rf_aic_bic["AICc"], "\n")

## AICc: 41843.57

cat("BIC:", rf_aic_bic["BIC"], "\n")

## BIC: 2213.708

Boosting (GBM) Model

The Boosting (GBM) Model achieves strong performance with the following configuration:
- Optimal Parameters:
- Number of Trees (n.trees): 450
- Interaction Depth: 10
- Shrinkage: 0.1
- Minimum Observations in Node: 10
- Test RMSE: 0.1075
- Test R²: 0.6381

The RMSE profile shows that performance improves as the number of trees increases, stabilizing around 450 trees. The Boosting model delivers solid accuracy, with good generalization and competitive results.

### Boosting (GBM) Model
library(gbm)
library(doParallel)
library(ggplot2)

# Set up parallel processing
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

set.seed(123)

# Train the GBM model
boost_model <- train(
  x = train_X, y = train_y,
  method = "gbm",
  tuneLength = 10,  # Automatically tunes over 10 combinations of parameters
  trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
  verbose = FALSE  # Suppress training output
)

# Generate predictions on the test set
boost_predictions <- predict(boost_model, test_X)
boost_rmse <- RMSE(boost_predictions, test_y)
boost_r2 <- cor(boost_predictions, test_y)^2

# Extract optimal parameters from the GBM model
optimal_params_boost <- boost_model$bestTune

# Compute AIC and BIC for Boosting (GBM)
compute_aic_bic_gbm <- function(model, test_X, test_y) {
  predictions <- predict(model, newdata = test_X)
  residuals <- test_y - predictions
  n <- length(test_y)  # Number of observations
  k <- model$finalModel$n.trees  # Use number of trees as a proxy for model complexity
  mse <- mean(residuals^2)  # Mean squared error
  loglik <- -n / 2 * log(2 * pi * mse) - n / 2  # Log-likelihood
  aic <- -2 * loglik + 2 * k  # AIC
  aicc <- aic + (2 * k * (k + 1)) / (n - k - 1)  # Corrected AIC
  bic <- -2 * loglik + log(n) * k  # BIC
  return(c(AIC = aic, AICc = aicc, BIC = bic))
}

boost_aic_bic <- compute_aic_bic_gbm(boost_model, test_X, test_y)

# Stop parallel processing
stopCluster(cl)
registerDoSEQ()

# Plot RMSE vs. Number of Trees, Shrinkage, and Depth
boost_results <- boost_model$results
ggplot(boost_results, aes(x = n.trees, y = RMSE, color = as.factor(interaction(shrinkage, interaction.depth)))) +
  geom_line() +
  geom_point(size = 2) +
  labs(
    title = "RMSE Profiles for the Boosting (GBM) Model",
    x = "Number of Trees",
    y = "RMSE (Cross-Validation)",
    color = "Shrinkage/Depth"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# Display results
cat("Boosting (GBM) Results:\n")

## Boosting (GBM) Results:

cat("Optimal Parameters:\n")

## Optimal Parameters:

print(optimal_params_boost)

##    n.trees interaction.depth shrinkage n.minobsinnode
## 99     450                10       0.1             10

cat("Test RMSE:", boost_rmse, "\n")

## Test RMSE: 0.1075398

cat("Test R^2:", boost_r2, "\n")

## Test R^2: 0.6380685

cat("AIC:", boost_aic_bic["AIC"], "\n")

## AIC: 67.9596

cat("AICc:", boost_aic_bic["AICc"], "\n")

## AICc: 6614.734

cat("BIC:", boost_aic_bic["BIC"], "\n")

## BIC: 1976.084

XGBoost Model

The Refined XGBoost Model achieves the following performance:
- Optimal Parameters:
- Boosting Rounds (nrounds): 200
- Max Depth: 9
- Learning Rate (eta): 0.1
- Test RMSE: 0.1099
- Test R²: 0.6224, explaining approximately 62% of the variance in the test set.

The RMSE profile shows significant improvement as the number of boosting rounds increases, stabilizing around 200 rounds. The model demonstrates strong performance with effective regularization and optimal hyperparameters.

### XGBoost Search for Boosting Rounds
library(xgboost)
library(caret)
library(doParallel)
library(ggplot2)

# Set seed for reproducibility
set.seed(123)

# Data preparation
data <- gradient_based_data
train_indices <- createDataPartition(data$PH, p = 0.8, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
train_X <- train_data[, -which(names(train_data) == "PH")]
train_y <- train_data$PH
test_X <- test_data[, -which(names(test_data) == "PH")]
test_y <- test_data$PH

# Ensure train_X and test_X are data frames
if (is.list(train_X)) train_X <- as.data.frame(train_X)
if (is.list(test_X)) test_X <- as.data.frame(test_X)

# Set up parallel processing
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

# Check for missing or infinite values
if (any(is.na(train_X)) || any(is.na(train_y)) || 
    any(is.infinite(as.matrix(train_X))) || any(is.infinite(train_y))) {
  stop("Training data contains missing or infinite values.")
}

# Refined search with fixed optimal max_depth and eta from earlier run
optimal_max_depth <- 9 
optimal_eta <- 0.1      

# Set up a granular grid for nrounds
refined_tune_grid <- expand.grid(
  nrounds = seq(50, 200, by = 25),  # More granular values for nrounds
  max_depth = optimal_max_depth,
  eta = optimal_eta,
  gamma = 0,
  colsample_bytree = 0.8,
  min_child_weight = 1,
  subsample = 0.8
)

# Train the model with the refined grid
xgb_refined_model <- tryCatch({
  train(
    x = train_X, y = train_y,
    method = "xgbTree",
    tuneGrid = refined_tune_grid,
    trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE, verboseIter = TRUE)
  )
}, error = function(e) {
  stop("Error during model training: ", e$message)
})

## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 200, max_depth = 9, eta = 0.1, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1, subsample = 0.8 on full training set

# Check if the model trained successfully
if (!exists("xgb_refined_model")) stop("XGBoost training failed.")

# Predictions and evaluation
xgb_refined_predictions <- predict(xgb_refined_model, test_X)
xgb_refined_rmse <- RMSE(xgb_refined_predictions, test_y)
xgb_refined_r2 <- cor(xgb_refined_predictions, test_y)^2

# Compute AIC and BIC for XGBoost
compute_aic_bic_xgb <- function(model, test_X, test_y) {
  predictions <- predict(model, newdata = test_X)
  residuals <- test_y - predictions
  n <- length(test_y)  # Number of observations
  k <- model$bestTune$nrounds  # Number of boosting rounds as proxy for parameters
  mse <- mean(residuals^2)  # Mean squared error
  loglik <- -n / 2 * log(2 * pi * mse) - n / 2  # Log-likelihood
  aic <- -2 * loglik + 2 * k  # AIC
  aicc <- aic + (2 * k * (k + 1)) / (n - k - 1)  # Corrected AIC
  bic <- -2 * loglik + log(n) * k  # BIC
  return(c(AIC = aic, AICc = aicc, BIC = bic))
}

xgb_aic_bic <- compute_aic_bic_xgb(xgb_refined_model, test_X, test_y)

# Stop parallel processing
stopCluster(cl)
registerDoSEQ()

# Display refined results
cat("Refined XGBoost Results:\n")

## Refined XGBoost Results:

cat("Test RMSE:", xgb_refined_rmse, "\n")

## Test RMSE: 0.1099225

cat("Test R^2:", xgb_refined_r2, "\n")

## Test R^2: 0.6223548

cat("AIC:", xgb_aic_bic["AIC"], "\n")

## AIC: -409.5562

cat("AICc:", xgb_aic_bic["AICc"], "\n")

## AICc: -151.8639

cat("BIC:", xgb_aic_bic["BIC"], "\n")

## BIC: 438.499

# Plot RMSE vs. nrounds
results_df <- xgb_refined_model$results
if (nrow(results_df) > 0) {
  ggplot(results_df, aes(x = nrounds, y = RMSE)) +
    geom_line(color = "blue") +
    geom_point(size = 2, color = "red") +
    labs(
      title = "RMSE Profiles for XGBoost (Refined Search)",
      x = "Number of Boosting Rounds (nrounds)",
      y = "RMSE (Cross-Validation)"
    ) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5))
} else {
  cat("No valid results available to plot.")
}

Performance Matrix for Tree Models

The Performance Matrix for Tree Models shows the following results:

Random Forest achieved the best performance with the lowest Test RMSE (0.1000) and the highest Test R² (0.7011), making it the most accurate model.
GBM follows closely with a Test RMSE of 0.1075 and Test R² of 0.6381, showing strong performance.
XGBoost has a slightly higher Test RMSE (0.1099) and Test R² (0.6224), but it still performs well.
Bagged Trees show the weakest performance with the highest Test RMSE (0.1252) and lowest Test R² (0.5187).

Overall, Random Forest is the top-performing model, providing the best balance between accuracy and generalization.

# Combine results into a data frame
results_df <- data.frame(
  Model = c("Bagged Trees", "Random Forest", "GBM", "XGBoost"),
  Test_RMSE = c(bagged_rmse, rf_rmse, boost_rmse, xgb_refined_rmse),
  Test_R2 = c(bagged_r2, rf_r2, boost_r2, xgb_refined_r2),
  AIC = c(bagged_aic_bic["AIC"], rf_aic_bic["AIC"], boost_aic_bic["AIC"], xgb_aic_bic["AIC"]),
  BIC = c(bagged_aic_bic["BIC"], rf_aic_bic["BIC"], boost_aic_bic["BIC"], xgb_aic_bic["BIC"])
)

# Sort by Test_RMSE
results_df <- results_df[order(results_df$Test_RMSE), ]

# Display results
results_df %>%
  kable(
    caption = "Summary of Tree Models Performance"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE  # Set to FALSE for custom widths
  ) %>%
  column_spec(1, width = "5cm") %>%  # Adjust the width of the first column
  column_spec(2:5, width = "3cm")  # Adjust the width of columns 2 to 5

Summary of Tree Models Performance
	Model	Test_RMSE	Test_R2	AIC	BIC
2	Random Forest	0.1000187	0.7011440	93.5700	2213.7079
3	GBM	0.1075398	0.6380685	67.9596	1976.0837
4	XGBoost	0.1099225	0.6223548	-409.5562	438.4990
1	Bagged Trees	0.1251824	0.5186901	-630.8485	-546.0429

Final Model Selection

Best Model and Justification:

Based on the Consolidated Model Performance Metrics, the Random Forest model is the best overall performer among tree-based models. It achieves: - Highest Predictive Accuracy: Test_R² of 0.7011 and Test_RMSE of 0.1000, indicating its superior ability to explain variability and predict accurately on unseen data. - Robustness: Minimal overfitting, as reflected in its aligned Train_R² and Test_R², as well as consistent RMSE values across training and testing. - Realistic Predictions: The Random Forest model predicts PH values that are highly consistent with expected ranges, making it ideal for practical applications.

Why Choose Another Tree-Based Model?

Although Random Forest is the top choice, other tree-based models might be considered depending on the dataset or specific requirements:

Flexibility:
- XGBoost is a competitive alternative with a Test_RMSE of 0.1099 and Test_R² of 0.6224. It is particularly effective for datasets with missing values or noisy data, thanks to its ability to handle complex interactions and fine-tuned hyperparameter optimization.
Gradient-Based Boosting:
- GBM provides solid performance with a Test_R² of 0.6381 and Test_RMSE of 0.1075. While slightly less accurate than XGBoost, GBM is simpler to implement and is often less computationally demanding for smaller datasets.
Ensemble Stability:
- Bagged Trees can be considered for datasets requiring high stability in predictions, although it underperforms relative to Random Forest, with a Test_RMSE of 0.1252 and Test_R² of 0.5187.

Why Exclude Statistical and Gradient-Based Models?

While the Interaction Term model and SVR perform well on test metrics, they fail to predict realistic PH values. These unrealistic predictions make them unsuitable for practical use cases, where interpretability and alignment with expected ranges are critical.

Recommendation:

Primary Choice: Random Forest for its unmatched predictive performance, robustness, and realistic predictions.
Alternative Choices:
- XGBoost for its robustness against noisy or incomplete data and ability to model complex relationships.
- GBM for a balance between accuracy and computational efficiency in smaller datasets.

By focusing solely on tree-based models, this selection ensures the emphasis remains on methods that consistently deliver realistic predictions while accommodating different modeling priorities.

# Combine results into a single data frame
all_models_performance <- data.frame(
  Model_Type = c(
    rep("OLS", 8),  # OLS Models
    rep("PLS", 5),  # Partial Least Squares Models
    rep("Non-Linear", 4),  # Non-linear Models
    rep("Tree-Based", 4)  # Tree-Based Models
  ),
  Model = c(
    # OLS Models
    "MLR (All Features Included)", 
    "Forward Selection", 
    "Backward Elimination", 
    "Stepwise Selection", 
    "Recursive Feature Elimination (RFE)", 
    "Interaction Terms", 
    "Polynomial Features", 
    "Feature Engineering",
    # PLS Models
    "PLS", 
    "PCR", 
    "Ridge Regression", 
    "Lasso Regression", 
    "Elastic Net",
    # Non-linear Models
    "KNN", 
    "MARS", 
    "SVR", 
    "ANN",
    # Tree-Based Models
    "Bagged Trees", 
    "Random Forest", 
    "GBM", 
    "XGBoost"
  ),
  Train_R2 = c(
    # OLS Models
    summary(full_model)$r.squared, 
    forward_metrics$Train_R2, 
    backward_metrics$Train_R2, 
    stepwise_metrics$Train_R2, 
    rfe_metrics$Train_R2, 
    interaction_metrics$Train_R2, 
    polynomial_metrics$Train_R2, 
    fe_metrics$Train_R2,
    # PLS Models
    pls_train_r2, 
    pcr_train_r2, 
    ridge_train_r2, 
    lasso_train_r2, 
    enet_train_r2,
    # Non-linear Models
    knn_model$results$Rsquared[which.min(knn_model$results$RMSE)],
    mars_model$results$Rsquared[which.min(mars_model$results$RMSE)],
    svr_model$results$Rsquared[which.min(svr_model$results$RMSE)],
    ann_model$results$Rsquared[which.min(ann_model$results$RMSE)],
    # Tree-Based Models
    bagged_r2, 
    rf_r2, 
    boost_r2, 
    xgb_refined_r2
  ),
  Test_R2 = c(
    # OLS Models
    evaluate_model(full_model, train_data, test_data)$Test_R2, 
    forward_metrics$Test_R2, 
    backward_metrics$Test_R2, 
    stepwise_metrics$Test_R2, 
    rfe_metrics$Test_R2, 
    interaction_metrics$Test_R2, 
    polynomial_metrics$Test_R2, 
    fe_metrics$Test_R2,
    # PLS Models
    pls_test_r2, 
    pcr_test_r2, 
    ridge_test_r2, 
    lasso_test_r2, 
    enet_test_r2,
    # Non-linear Models
    knn_r2, 
    mars_r2, 
    svr_r2, 
    ann_r2,
    # Tree-Based Models
    bagged_r2, 
    rf_r2, 
    boost_r2, 
    xgb_refined_r2
  ),
  Train_RMSE = c(
    # OLS Models
    evaluate_model(full_model, train_data, test_data)$Train_RMSE, 
    forward_metrics$Train_RMSE, 
    backward_metrics$Train_RMSE, 
    stepwise_metrics$Train_RMSE, 
    rfe_metrics$Train_RMSE, 
    interaction_metrics$Train_RMSE, 
    polynomial_metrics$Train_RMSE, 
    fe_metrics$Train_RMSE,
    # PLS Models
    pls_train_rmse, 
    pcr_train_rmse, 
    ridge_train_rmse, 
    lasso_train_rmse, 
    enet_train_rmse,
    # Non-linear Models
    knn_model$results$RMSE[which.min(knn_model$results$RMSE)],
    mars_model$results$RMSE[which.min(mars_model$results$RMSE)],
    svr_model$results$RMSE[which.min(svr_model$results$RMSE)],
    ann_model$results$RMSE[which.min(ann_model$results$RMSE)],
    # Tree-Based Models
    bagged_rmse, 
    rf_rmse, 
    boost_rmse, 
    xgb_refined_rmse
  ),
  Test_RMSE = c(
    # OLS Models
    evaluate_model(full_model, train_data, test_data)$Test_RMSE, 
    forward_metrics$Test_RMSE, 
    backward_metrics$Test_RMSE, 
    stepwise_metrics$Test_RMSE, 
    rfe_metrics$Test_RMSE, 
    interaction_metrics$Test_RMSE, 
    polynomial_metrics$Test_RMSE, 
    fe_metrics$Test_RMSE,
    # PLS Models
    pls_test_rmse, 
    pcr_test_rmse, 
    ridge_test_rmse, 
    lasso_test_rmse, 
    enet_test_rmse,
    # Non-linear Models
    knn_rmse, 
    mars_rmse, 
    svr_rmse, 
    ann_rmse,
    # Tree-Based Models
    bagged_rmse, 
    rf_rmse, 
    boost_rmse, 
    xgb_refined_rmse
  ),
  AIC = c(
    rep(NA, 8),  # OLS Models do not have AIC values
    rep(NA, 5),  # PLS Models do not have AIC values
    knn_aic_bic["AIC"], mars_aic_bic["AIC"], svr_aic_bic["AIC"], ann_aic_bic["AIC"],
    bagged_aic_bic["AIC"], rf_aic_bic["AIC"], boost_aic_bic["AIC"], xgb_aic_bic["AIC"]
  ),
  BIC = c(
    rep(NA, 8),  # OLS Models do not have BIC values
    rep(NA, 5),  # PLS Models do not have BIC values
    knn_aic_bic["BIC"], mars_aic_bic["BIC"], svr_aic_bic["BIC"], ann_aic_bic["BIC"],
    bagged_aic_bic["BIC"], rf_aic_bic["BIC"], boost_aic_bic["BIC"], xgb_aic_bic["BIC"]
  )
)

# Sort the data frame by Test_RMSE
all_models_performance <- all_models_performance %>% arrange(Test_RMSE)

# Display the consolidated DataFrame
all_models_performance %>%
  kable(
    caption = "Consolidated Model Performance Metrics"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  ) %>%
  column_spec(1, width = "10cm") %>%
  column_spec(2:7, width = "6cm")

Consolidated Model Performance Metrics
Model_Type	Model	Train_R2	Test_R2	Train_RMSE	Test_RMSE	AIC	BIC
Tree-Based	Random Forest	0.7011440	0.7011440	0.1000187	0.1000187	93.5700	2213.7079
Tree-Based	GBM	0.6380685	0.6380685	0.1075398	0.1075398	67.9596	1976.0837
Tree-Based	XGBoost	0.6223548	0.6223548	0.1099225	0.1099225	-409.5562	438.4990
Non-Linear	SVR	0.5242925	0.5381500	0.1184443	0.1216659	2794.5859	10215.0686
Tree-Based	Bagged Trees	0.5186901	0.5186901	0.1251824	0.1251824	-630.8485	-546.0429
Non-Linear	ANN	0.4810389	0.4972056	0.1248814	0.1276438	-186.2019	810.2630
Non-Linear	KNN	0.4680361	0.4892592	0.1258932	0.1286795	-647.9108	-647.9108
OLS	Interaction Terms	0.6187726	0.4706976	0.1054870	0.1300389	NA	NA
Non-Linear	MARS	0.4347175	0.4500877	0.1288138	0.1327350	-584.0745	-516.2301
PLS	Lasso Regression	0.4015648	0.3683422	0.1321666	0.1421016	NA	NA
PLS	Ridge Regression	0.4003721	0.3695003	0.1323733	0.1421150	NA	NA
OLS	Feature Engineering	0.4023797	0.3678177	0.1320747	0.1421158	NA	NA
OLS	Recursive Feature Elimination (RFE)	0.4016072	0.3677272	0.1321600	0.1421260	NA	NA
OLS	Backward Elimination	0.4005068	0.3659733	0.1322815	0.1423230	NA	NA
OLS	Forward Selection	0.4005068	0.3659733	0.1322815	0.1423230	NA	NA
OLS	Stepwise Selection	0.4005068	0.3659733	0.1322815	0.1423230	NA	NA
PLS	PCR	0.3889516	0.3659666	0.1335502	0.1423654	NA	NA
OLS	Polynomial Features	0.4054491	0.3635935	0.1317351	0.1425899	NA	NA
PLS	PLS	0.3864426	0.3606606	0.1338241	0.1429587	NA	NA
OLS	MLR (All Features Included)	0.4016072	-82.0321851	1.6236157	1.6287129	NA	NA
PLS	Elastic Net	0.0442056	0.0235254	194.6793992	194.0983447	NA	NA

Model Predictions and Evaluation using Multiple Regression and Machine Learning Models

Prediction: Using trained models (Random Forest [Best Model], Interaction Terms OLS, Support Vector Regression, and XGBoost [Alternate Models]) to predict the target variable (PH) on a new evaluation dataset that lacks the ground truth for comparison.
Preprocessing: Applying appropriate data preprocessing steps (e.g., scaling, encoding, and interaction term generation) specific to each model type.
Evaluation: Visualizing the distribution of predicted values and comparing predictions across models using plots and summary statistics.
Model Comparison: Discussing which model performs best and why one might choose an alternate model based on interpretability, ease of deployment, or other practical factors.

Below are the code blocks to make predictions on the evaluation_set_boxcox dataset using the Random Forest (best model) and alternative models (Interaction Terms OLS, SVR, and XGBoost). Necessary data preprocessing steps for each model are included, along with forecasting and evaluation plots.

Load Libraries

# Load required libraries
library(randomForest)
library(caret)
library(e1071)
library(xgboost)
library(dplyr)
library(ggplot2)
library(Metrics)

Load Evaluation Dataset

# Load the evaluation dataset
evaluation_data <- evaluation_set_transformed

# Imputed dataset for evaluation
evaluation_data_imputed <- evaluation_set_imputed

1. Random Forest Predictions

# Random Forest Predictions
set.seed(123)
rf_predictions <- predict(rf_model, newdata = tree_based_evaluation_data)

# Add predictions to evaluation dataset
tree_based_evaluation_data$PH_predicted_rf <- rf_predictions

# Reorder columns to place PH_predicted_rf next to PH
if ("PH" %in% colnames(tree_based_evaluation_data)) {
  col_order <- c("PH", "PH_predicted_rf", setdiff(names(tree_based_evaluation_data), c("PH", "PH_predicted_rf")))
  tree_based_evaluation_data <- tree_based_evaluation_data[, col_order]
}

# Save results to a CSV file
write.csv(tree_based_evaluation_data, "rf_evaluation_results_with_predictions.csv", row.names = FALSE)

# Plot Predicted PH Distribution
ggplot(tree_based_evaluation_data, aes(x = PH_predicted_rf)) +
  geom_histogram(binwidth = 0.1, fill = "blue", alpha = 0.7) +
  labs(
    title = "Predicted PH Distribution (Random Forest)",
    x = "Predicted PH",
    y = "Frequency"
  ) +
  theme_minimal()

# Display the updated dataset
str(tree_based_evaluation_data)

## 'data.frame':    267 obs. of  35 variables:
##  $ PH               : logi  NA NA NA NA NA NA ...
##  $ PH_predicted_rf  : num  8.63 8.63 8.61 8.62 8.58 ...
##  $ Brand.Code       : Factor w/ 4 levels "A","B","C","D": 4 1 2 2 2 2 1 2 1 4 ...
##  $ Carb.Volume      : num  5.48 5.39 5.29 5.27 5.41 ...
##  $ Fill.Ounces      : num  24 24 23.9 23.9 24.2 ...
##  $ PC.Volume        : num  0.278 0.232 0.313 0.19 0.163 ...
##  $ Carb.Pressure    : num  0.471 0.471 0.471 0.471 0.471 ...
##  $ Carb.Temp        : num  135 135 140 139 142 ...
##  $ PSC              : num  0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
##  $ PSC.Fill         : num  0.2144 0.1509 0.0832 0.1414 0.1834 ...
##  $ PSC.CO2          : num  0.04 0.08 0.02 0.02 0.06 0.02 0 0.04 0.04 0.02 ...
##  $ Mnf.Flow         : num  -115 -115 -115 -115 -115 ...
##  $ Carb.Pressure1   : num  9.74 9.81 9.85 9.99 9.69 ...
##  $ Fill.Pressure    : num  9.73 9.76 9.71 9.05 10.31 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  3.68 3.77 3.69 3.88 3.66 ...
##  $ Filler.Level     : num  129 120 119 120 116 ...
##  $ Filler.Speed     : int  3986 4012 4010 1016 4018 4010 4010 3986 4010 1006 ...
##  $ Temperature      : num  66 65.6 65.6 74.4 66.4 66.6 66.8 70.6 65.8 66 ...
##  $ Usage.cont       : num  21.7 17.6 24.2 18.1 21.3 ...
##  $ Carb.Flow        : num  4624 4567.4 4801 32.9 5065.5 ...
##  $ Density          : num  0.671 1.001 0.683 0.584 0.671 ...
##  $ MFR              : num  728 736 735 313 752 ...
##  $ Balling          : num  0.533 0.658 0.54 0.476 0.533 ...
##  $ Pressure.Vacuum  : num  -48.2 -70.6 -62.5 -55 -55 ...
##  $ Oxygen.Filler    : num  0.022 0.03 0.046 0.086 0.082 0.064 0.042 0.096 0.046 0.096 ...
##  $ Bowl.Setpoint    : int  130 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  0.452 0.452 0.452 0.452 0.452 ...
##  $ Air.Pressurer    : num  143 147 147 146 146 ...
##  $ Alch.Rel         : num  6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
##  $ Carb.Rel         : num  5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
##  $ Balling.Lvl      : num  0.942 1.477 0.933 0.942 0.933 ...
##  $ Brand_Code       : num  4 1 2 2 2 2 1 2 1 4 ...

4. Boosting Predictions

# Load necessary libraries
library(ggplot2)
library(caret)
library(gbm)

# Set seed for reproducibility
set.seed(123)

# Generate predictions on the evaluation dataset
gbm_predictions <- predict(boost_model, newdata = tree_based_evaluation_data)

# Add predictions to evaluation dataset
tree_based_evaluation_data$PH_predicted_gbm <- gbm_predictions

# Reorder columns to place PH_predicted_gbm next to PH
if ("PH" %in% colnames(tree_based_evaluation_data)) {
  col_order <- c("PH", "PH_predicted_gbm", setdiff(names(tree_based_evaluation_data), c("PH", "PH_predicted_gbm")))
  tree_based_evaluation_data <- tree_based_evaluation_data[, col_order]
}

# Save results to a CSV file
write.csv(tree_based_evaluation_data, "gbm_evaluation_results_with_predictions.csv", row.names = FALSE)

# Plot Predicted PH Distribution
ggplot(tree_based_evaluation_data, aes(x = PH_predicted_gbm)) +
  geom_histogram(binwidth = 0.1, fill = "blue", alpha = 0.7) +
  labs(
    title = "Predicted PH Distribution (GBM)",
    x = "Predicted PH",
    y = "Frequency"
  ) +
  theme_minimal()

# Display the updated dataset
str(tree_based_evaluation_data)

## 'data.frame':    267 obs. of  36 variables:
##  $ PH               : logi  NA NA NA NA NA NA ...
##  $ PH_predicted_gbm : num  8.72 8.73 8.75 8.79 8.68 ...
##  $ PH_predicted_rf  : num  8.63 8.63 8.61 8.62 8.58 ...
##  $ Brand.Code       : Factor w/ 4 levels "A","B","C","D": 4 1 2 2 2 2 1 2 1 4 ...
##  $ Carb.Volume      : num  5.48 5.39 5.29 5.27 5.41 ...
##  $ Fill.Ounces      : num  24 24 23.9 23.9 24.2 ...
##  $ PC.Volume        : num  0.278 0.232 0.313 0.19 0.163 ...
##  $ Carb.Pressure    : num  0.471 0.471 0.471 0.471 0.471 ...
##  $ Carb.Temp        : num  135 135 140 139 142 ...
##  $ PSC              : num  0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
##  $ PSC.Fill         : num  0.2144 0.1509 0.0832 0.1414 0.1834 ...
##  $ PSC.CO2          : num  0.04 0.08 0.02 0.02 0.06 0.02 0 0.04 0.04 0.02 ...
##  $ Mnf.Flow         : num  -115 -115 -115 -115 -115 ...
##  $ Carb.Pressure1   : num  9.74 9.81 9.85 9.99 9.69 ...
##  $ Fill.Pressure    : num  9.73 9.76 9.71 9.05 10.31 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  3.68 3.77 3.69 3.88 3.66 ...
##  $ Filler.Level     : num  129 120 119 120 116 ...
##  $ Filler.Speed     : int  3986 4012 4010 1016 4018 4010 4010 3986 4010 1006 ...
##  $ Temperature      : num  66 65.6 65.6 74.4 66.4 66.6 66.8 70.6 65.8 66 ...
##  $ Usage.cont       : num  21.7 17.6 24.2 18.1 21.3 ...
##  $ Carb.Flow        : num  4624 4567.4 4801 32.9 5065.5 ...
##  $ Density          : num  0.671 1.001 0.683 0.584 0.671 ...
##  $ MFR              : num  728 736 735 313 752 ...
##  $ Balling          : num  0.533 0.658 0.54 0.476 0.533 ...
##  $ Pressure.Vacuum  : num  -48.2 -70.6 -62.5 -55 -55 ...
##  $ Oxygen.Filler    : num  0.022 0.03 0.046 0.086 0.082 0.064 0.042 0.096 0.046 0.096 ...
##  $ Bowl.Setpoint    : int  130 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  0.452 0.452 0.452 0.452 0.452 ...
##  $ Air.Pressurer    : num  143 147 147 146 146 ...
##  $ Alch.Rel         : num  6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
##  $ Carb.Rel         : num  5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
##  $ Balling.Lvl      : num  0.942 1.477 0.933 0.942 0.933 ...
##  $ Brand_Code       : num  4 1 2 2 2 2 1 2 1 4 ...

4. XGBoost Predictions

# Predictions for XGBoost
set.seed(123)

# Generate predictions on the evaluation dataset
xgb_predictions <- predict(xgb_refined_model, newdata = gradient_models_evaluation_data_final)

# Add predictions to evaluation dataset
tree_based_evaluation_data$PH_predicted_xgb <- xgb_predictions

# Reorder columns to place PH_predicted_xgb next to PH
if ("PH" %in% colnames(tree_based_evaluation_data)) {
  col_order <- c("PH", "PH_predicted_xgb", setdiff(names(tree_based_evaluation_data), c("PH", "PH_predicted_xgb")))
  tree_based_evaluation_data <- tree_based_evaluation_data[, col_order]
}

# Save results to a CSV file
write.csv(tree_based_evaluation_data, "xgb_evaluation_results_with_predictions.csv", row.names = FALSE)

# Plot Predicted PH Distribution
ggplot(tree_based_evaluation_data, aes(x = PH_predicted_xgb)) +
  geom_histogram(binwidth = 0.1, fill = "blue", alpha = 0.7) +
  labs(
    title = "Predicted PH Distribution (XGBoost)",
    x = "Predicted PH",
    y = "Frequency"
  ) +
  theme_minimal()

# Display the updated dataset
str(tree_based_evaluation_data)

## 'data.frame':    267 obs. of  37 variables:
##  $ PH               : logi  NA NA NA NA NA NA ...
##  $ PH_predicted_xgb : num  8.56 8.59 8.57 8.59 8.55 ...
##  $ PH_predicted_gbm : num  8.72 8.73 8.75 8.79 8.68 ...
##  $ PH_predicted_rf  : num  8.63 8.63 8.61 8.62 8.58 ...
##  $ Brand.Code       : Factor w/ 4 levels "A","B","C","D": 4 1 2 2 2 2 1 2 1 4 ...
##  $ Carb.Volume      : num  5.48 5.39 5.29 5.27 5.41 ...
##  $ Fill.Ounces      : num  24 24 23.9 23.9 24.2 ...
##  $ PC.Volume        : num  0.278 0.232 0.313 0.19 0.163 ...
##  $ Carb.Pressure    : num  0.471 0.471 0.471 0.471 0.471 ...
##  $ Carb.Temp        : num  135 135 140 139 142 ...
##  $ PSC              : num  0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
##  $ PSC.Fill         : num  0.2144 0.1509 0.0832 0.1414 0.1834 ...
##  $ PSC.CO2          : num  0.04 0.08 0.02 0.02 0.06 0.02 0 0.04 0.04 0.02 ...
##  $ Mnf.Flow         : num  -115 -115 -115 -115 -115 ...
##  $ Carb.Pressure1   : num  9.74 9.81 9.85 9.99 9.69 ...
##  $ Fill.Pressure    : num  9.73 9.76 9.71 9.05 10.31 ...
##  $ Hyd.Pressure1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyd.Pressure4    : num  3.68 3.77 3.69 3.88 3.66 ...
##  $ Filler.Level     : num  129 120 119 120 116 ...
##  $ Filler.Speed     : int  3986 4012 4010 1016 4018 4010 4010 3986 4010 1006 ...
##  $ Temperature      : num  66 65.6 65.6 74.4 66.4 66.6 66.8 70.6 65.8 66 ...
##  $ Usage.cont       : num  21.7 17.6 24.2 18.1 21.3 ...
##  $ Carb.Flow        : num  4624 4567.4 4801 32.9 5065.5 ...
##  $ Density          : num  0.671 1.001 0.683 0.584 0.671 ...
##  $ MFR              : num  728 736 735 313 752 ...
##  $ Balling          : num  0.533 0.658 0.54 0.476 0.533 ...
##  $ Pressure.Vacuum  : num  -48.2 -70.6 -62.5 -55 -55 ...
##  $ Oxygen.Filler    : num  0.022 0.03 0.046 0.086 0.082 0.064 0.042 0.096 0.046 0.096 ...
##  $ Bowl.Setpoint    : int  130 120 120 120 120 120 120 120 120 120 ...
##  $ Pressure.Setpoint: num  0.452 0.452 0.452 0.452 0.452 ...
##  $ Air.Pressurer    : num  143 147 147 146 146 ...
##  $ Alch.Rel         : num  6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
##  $ Carb.Rel         : num  5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
##  $ Balling.Lvl      : num  0.942 1.477 0.933 0.942 0.933 ...
##  $ Brand_Code       : num  4 1 2 2 2 2 1 2 1 4 ...

Comparison of Predicted PH Values Across Models

The pairwise scatterplots compare the predicted PH values across three models: Random Forest (RF), Gradient Boosting Machine (GBM), and XGBoost (XGB). The following observations can be made:

Strong Agreement: The diagonal patterns in the plots indicate a high level of agreement between models. Predictions from the models are closely aligned, with minimal dispersion.
Consistency: The predictions show consistent relationships across pairs of models, suggesting that all three models capture similar trends in the data.
Differences: Slight deviations in alignment (e.g., slight spread around the diagonal) suggest that while the models agree overall, there are subtle differences in their predictions, which may reflect differences in how each model handles data and relationships.

This analysis highlights the reliability of the models, while also pointing to minor variations that could be further explored depending on application requirements.

# Combine predictions into a single data frame
predictions_df <- tree_based_evaluation_data %>%
  select(PH_predicted_rf, PH_predicted_gbm, PH_predicted_xgb)

# Display summary of predicted PH values
summary(predictions_df)

##  PH_predicted_rf PH_predicted_gbm PH_predicted_xgb
##  Min.   :8.338   Min.   :8.320    Min.   :8.192   
##  1st Qu.:8.484   1st Qu.:8.565    1st Qu.:8.418   
##  Median :8.537   Median :8.636    Median :8.507   
##  Mean   :8.546   Mean   :8.632    Mean   :8.497   
##  3rd Qu.:8.609   3rd Qu.:8.702    3rd Qu.:8.588   
##  Max.   :8.744   Max.   :8.875    Max.   :8.701

# Save results to a CSV file
write.csv(predictions_df, "combined_prediction_result.csv", row.names = FALSE)


# Pairwise scatterplots to visualize consistency between models
pairs(predictions_df, 
      main = "Predicted PH Values: Model Comparisons", 
      pch = 21, 
      bg = c("blue", "green", "red", "purple"))