Both datasets have consistent structures with 33 variables.
# Load training and test datasets with explicit missing value handling
train_set <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA-624/refs/heads/main/Project%202/StudentData.csv", na.strings = c("", "NA", "NULL"))
evaluation_set <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA-624/refs/heads/main/Project%202/StudentEvaluation.csv", na.strings = c("", "NA", "NULL"))
# Verify missing values in Brand.Code
cat("Number of missing values in Brand.Code (Training Dataset):", sum(is.na(train_set$Brand.Code)), "\n")
## Number of missing values in Brand.Code (Training Dataset): 120
cat("Number of missing values in Brand.Code (Evaluation Dataset):", sum(is.na(evaluation_set$Brand.Code)), "\n")
## Number of missing values in Brand.Code (Evaluation Dataset): 8
# Check the structure of the dataset to confirm the changes
str(train_set)
## 'data.frame': 2571 obs. of 33 variables:
## $ Brand.Code : chr "B" "A" "B" "A" ...
## $ Carb.Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill.Ounces : num 24 24 24.1 24 24.3 ...
## $ PC.Volume : num 0.263 0.239 0.263 0.293 0.111 ...
## $ Carb.Pressure : num 68.2 68.4 70.8 63 67.2 66.6 64.2 67.6 64.2 72 ...
## $ Carb.Temp : num 141 140 145 133 137 ...
## $ PSC : num 0.104 0.124 0.09 NA 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC.Fill : num 0.26 0.22 0.34 0.42 0.16 0.24 0.4 0.34 0.12 0.24 ...
## $ PSC.CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf.Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb.Pressure1 : num 119 122 120 115 118 ...
## $ Fill.Pressure : num 46 46 46 46.4 45.8 45.6 51.8 46.8 46 45.2 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num NA NA NA 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : int 118 106 82 92 92 116 124 132 90 108 ...
## $ Filler.Level : num 121 119 120 118 119 ...
## $ Filler.Speed : int 4002 3986 4020 4012 4010 4014 NA 1004 4014 4028 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage.cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb.Flow : int 2932 3144 2914 3062 3054 2948 30 684 2902 3038 ...
## $ Density : num 0.88 0.92 1.58 1.54 1.54 1.52 0.84 0.84 0.9 0.9 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 1.4 1.5 3.14 3.04 3.04 ...
## $ Pressure.Vacuum : num -4 -4 -3.8 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen.Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl.Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 143 142 146 146 ...
## $ Alch.Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb.Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling.Lvl : num 1.48 1.56 3.28 3.04 3.04 3.02 1.44 1.44 1.44 1.38 ...
cat("\n\n")
# Check the structure of the dataset to confirm the changes
str(evaluation_set)
## 'data.frame': 267 obs. of 33 variables:
## $ Brand.Code : chr "D" "A" "B" "B" ...
## $ Carb.Volume : num 5.48 5.39 5.29 5.27 5.41 ...
## $ Fill.Ounces : num 24 24 23.9 23.9 24.2 ...
## $ PC.Volume : num 0.27 0.227 0.303 0.186 0.16 ...
## $ Carb.Pressure : num 65.4 63.2 66.4 64.8 69.4 73.4 65.2 67.4 66.8 72.6 ...
## $ Carb.Temp : num 135 135 140 139 142 ...
## $ PSC : num 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
## $ PSC.Fill : num 0.4 0.22 0.1 0.2 0.3 0.22 0.14 0.1 0.48 0.1 ...
## $ PSC.CO2 : num 0.04 0.08 0.02 0.02 0.06 NA 0 0.04 0.04 0.02 ...
## $ Mnf.Flow : num -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 ...
## $ Carb.Pressure1 : num 117 119 120 125 115 ...
## $ Fill.Pressure : num 46 46.2 45.8 40 51.4 46.4 46.2 40 43.8 40.8 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num NA 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num NA 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : int 96 112 98 132 94 94 108 108 110 106 ...
## $ Filler.Level : num 129 120 119 120 116 ...
## $ Filler.Speed : int 3986 4012 4010 NA 4018 4010 4010 NA 4010 1006 ...
## $ Temperature : num 66 65.6 65.6 74.4 66.4 66.6 66.8 NA 65.8 66 ...
## $ Usage.cont : num 21.7 17.6 24.2 18.1 21.3 ...
## $ Carb.Flow : int 2950 2916 3056 28 3214 3064 3042 1972 2502 28 ...
## $ Density : num 0.88 1.5 0.9 0.74 0.88 0.84 1.48 1.6 1.52 1.48 ...
## $ MFR : num 728 736 735 NA 752 ...
## $ Balling : num 1.4 2.94 1.45 1.06 1.4 ...
## $ Pressure.Vacuum : num -3.8 -4.4 -4.2 -4 -4 -3.8 -4.2 -4.4 -4.4 -4.2 ...
## $ PH : logi NA NA NA NA NA NA ...
## $ Oxygen.Filler : num 0.022 0.03 0.046 NA 0.082 0.064 0.042 0.096 0.046 0.096 ...
## $ Bowl.Setpoint : int 130 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 45.2 46 46 46 50 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 147 147 146 146 ...
## $ Alch.Rel : num 6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
## $ Carb.Rel : num 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
## $ Balling.Lvl : num 1.48 3.04 1.46 1.48 1.46 1.44 3.02 3 3.1 3.12 ...
Hyd.Pressure1
: -0.80 to 58.00, SD = 12.43), likely
requiring transformation.Carb.Flow
, Mnf.Flow
,
Hyd.Pressure*
) to normalize distributions.Carb.Flow
, Filler.Speed
).MFR
due to its notable
percentage (8.25%).# Load necessary libraries
# install.packages('kableExtra', repos='http://cran.rstudio.com/')
library(kableExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
# Create vectors of numeric and categorical variables for insurance_training
numeric_vars <- names(train_set)[sapply(train_set, is.numeric)]
categorical_vars <- names(train_set)[sapply(train_set, is.factor)]
# Correctly select numerical variables using the predefined numeric_vars
numerical_vars <- train_set %>%
dplyr::select(all_of(numeric_vars))
# Compute statistical summary including missing value counts and percentages
statistical_summary <- numerical_vars %>%
summarise(across(
everything(),
list(
Min = ~round(min(., na.rm = TRUE), 2),
Q1 = ~round(quantile(., 0.25, na.rm = TRUE), 2),
Mean = ~round(mean(., na.rm = TRUE), 2),
Median = ~round(median(., na.rm = TRUE), 2),
Q3 = ~round(quantile(., 0.75, na.rm = TRUE), 2),
Max = ~round(max(., na.rm = TRUE), 2),
SD = ~round(sd(., na.rm = TRUE), 2),
Missing = ~sum(is.na(.)), # Count of missing values
PercentMissing = ~round(mean(is.na(.)) * 100, 2) # Percentage of missing values
),
.names = "{.col}_{.fn}"
)) %>%
pivot_longer(
cols = everything(),
names_to = c("Variable", ".value"),
names_pattern = "^(.*)_(.*)$"
)
# Display the resulting summary table
statistical_summary %>%
kable(caption = "Summary of Numerical Variables (Including Missing Counts and Percentages)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Variable | Min | Q1 | Mean | Median | Q3 | Max | SD | Missing | PercentMissing |
---|---|---|---|---|---|---|---|---|---|
Carb.Volume | 5.04 | 5.29 | 5.37 | 5.35 | 5.45 | 5.70 | 0.11 | 10 | 0.39 |
Fill.Ounces | 23.63 | 23.92 | 23.97 | 23.97 | 24.03 | 24.32 | 0.09 | 38 | 1.48 |
PC.Volume | 0.08 | 0.24 | 0.28 | 0.27 | 0.31 | 0.48 | 0.06 | 39 | 1.52 |
Carb.Pressure | 57.00 | 65.60 | 68.19 | 68.20 | 70.60 | 79.40 | 3.54 | 27 | 1.05 |
Carb.Temp | 128.60 | 138.40 | 141.09 | 140.80 | 143.80 | 154.00 | 4.04 | 26 | 1.01 |
PSC | 0.00 | 0.05 | 0.08 | 0.08 | 0.11 | 0.27 | 0.05 | 33 | 1.28 |
PSC.Fill | 0.00 | 0.10 | 0.20 | 0.18 | 0.26 | 0.62 | 0.12 | 23 | 0.89 |
PSC.CO2 | 0.00 | 0.02 | 0.06 | 0.04 | 0.08 | 0.24 | 0.04 | 39 | 1.52 |
Mnf.Flow | -100.20 | -100.00 | 24.57 | 65.20 | 140.80 | 229.40 | 119.48 | 2 | 0.08 |
Carb.Pressure1 | 105.60 | 119.00 | 122.59 | 123.20 | 125.40 | 140.20 | 4.74 | 32 | 1.24 |
Fill.Pressure | 34.60 | 46.00 | 47.92 | 46.40 | 50.00 | 60.40 | 3.18 | 22 | 0.86 |
Hyd.Pressure1 | -0.80 | 0.00 | 12.44 | 11.40 | 20.20 | 58.00 | 12.43 | 11 | 0.43 |
Hyd.Pressure2 | 0.00 | 0.00 | 20.96 | 28.60 | 34.60 | 59.40 | 16.39 | 15 | 0.58 |
Hyd.Pressure3 | -1.20 | 0.00 | 20.46 | 27.60 | 33.40 | 50.00 | 15.98 | 15 | 0.58 |
Hyd.Pressure4 | 52.00 | 86.00 | 96.29 | 96.00 | 102.00 | 142.00 | 13.12 | 30 | 1.17 |
Filler.Level | 55.80 | 98.30 | 109.25 | 118.40 | 120.00 | 161.20 | 15.70 | 20 | 0.78 |
Filler.Speed | 998.00 | 3888.00 | 3687.20 | 3982.00 | 3998.00 | 4030.00 | 770.82 | 57 | 2.22 |
Temperature | 63.60 | 65.20 | 65.97 | 65.60 | 66.40 | 76.20 | 1.38 | 14 | 0.54 |
Usage.cont | 12.08 | 18.36 | 20.99 | 21.79 | 23.75 | 25.90 | 2.98 | 5 | 0.19 |
Carb.Flow | 26.00 | 1144.00 | 2468.35 | 3028.00 | 3186.00 | 5104.00 | 1073.70 | 2 | 0.08 |
Density | 0.24 | 0.90 | 1.17 | 0.98 | 1.62 | 1.92 | 0.38 | 1 | 0.04 |
MFR | 31.40 | 706.30 | 704.05 | 724.00 | 731.00 | 868.60 | 73.90 | 212 | 8.25 |
Balling | -0.17 | 1.50 | 2.20 | 1.65 | 3.29 | 4.01 | 0.93 | 1 | 0.04 |
Pressure.Vacuum | -6.60 | -5.60 | -5.22 | -5.40 | -5.00 | -3.60 | 0.57 | 0 | 0.00 |
PH | 7.88 | 8.44 | 8.55 | 8.54 | 8.68 | 9.36 | 0.17 | 4 | 0.16 |
Oxygen.Filler | 0.00 | 0.02 | 0.05 | 0.03 | 0.06 | 0.40 | 0.05 | 12 | 0.47 |
Bowl.Setpoint | 70.00 | 100.00 | 109.33 | 120.00 | 120.00 | 140.00 | 15.30 | 2 | 0.08 |
Pressure.Setpoint | 44.00 | 46.00 | 47.62 | 46.00 | 50.00 | 52.00 | 2.04 | 12 | 0.47 |
Air.Pressurer | 140.80 | 142.20 | 142.83 | 142.60 | 143.00 | 148.20 | 1.21 | 0 | 0.00 |
Alch.Rel | 5.28 | 6.54 | 6.90 | 6.56 | 7.24 | 8.62 | 0.51 | 9 | 0.35 |
Carb.Rel | 4.96 | 5.34 | 5.44 | 5.40 | 5.54 | 6.06 | 0.13 | 10 | 0.39 |
Balling.Lvl | 0.00 | 1.38 | 2.05 | 1.48 | 3.14 | 3.66 | 0.87 | 1 | 0.04 |
High Missingness (e.g., MFR
~8%):
Use predictive imputation methods like MICE
(Multivariate Imputation by Chained Equations) or KNN to estimate values
based on patterns in other variables.
Low to Moderate Missingness (<5%):
PC.Volume
,
Fill.Ounces
).library(ggplot2)
library(dplyr)
# Prepare data for missing values
missing_data <- train_set %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing") %>%
mutate(PercentMissing = (Missing / nrow(train_set)) * 100) %>%
filter(Missing > 0) # Include only variables with missing values
# Create the flipped bar chart
ggplot(missing_data, aes(x = reorder(Variable, PercentMissing), y = PercentMissing)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Missing Values by Variable",
x = "Variable",
y = "Percentage of Missing Values (%)"
) +
theme_minimal()
The Brand.Code
variable has 5 unique levels, no missing
values, and a mode of B
(48.19%). It shows moderate entropy
(1.93) and a high imbalance ratio (10.32), indicating a skewed
distribution.
library(dplyr)
library(knitr)
# Select categorical columns including Brand.Code
categorical_columns <- train_set %>%
dplyr::select(all_of(c(categorical_vars, "Brand.Code")))
# Function to calculate Shannon entropy
calculate_entropy <- function(counts) {
proportions <- counts / sum(counts)
entropy <- -sum(proportions * log2(proportions), na.rm = TRUE)
return(entropy)
}
# Function to calculate imbalance ratio
calculate_imbalance_ratio <- function(counts) {
max_count <- max(counts, na.rm = TRUE)
min_count <- min(counts, na.rm = TRUE)
if (min_count == 0) {
imbalance_ratio <- Inf # Avoid division by zero
} else {
imbalance_ratio <- max_count / min_count
}
return(imbalance_ratio)
}
# Ensure all levels for each variable are included
complete_levels <- function(var, data) {
unique_levels <- unique(data[[var]])
factor(unique_levels, levels = unique_levels)
}
# Compute the summary for each categorical variable
categorical_summary <- lapply(names(categorical_columns), function(var) {
# Ensure all levels are accounted for, even with 0 counts
summary_df <- train_set %>%
count(!!sym(var), .drop = FALSE) %>%
complete(!!sym(var) := unique(train_set[[var]]), fill = list(n = 0)) %>%
mutate(Percentage = round(n / sum(n) * 100, 2)) %>%
rename(Level = !!sym(var), Count = n) %>%
mutate(Variable = var) # Add the variable name for identification
# Compute the mode for the variable
mode_row <- summary_df %>%
filter(Count == max(Count, na.rm = TRUE)) %>%
slice_head(n = 1) %>% # Use `slice_head()` to handle ties safely
pull(Level)
# Compute percentage for the mode
mode_percentage <- summary_df %>%
filter(Level == mode_row) %>%
pull(Percentage) %>%
first() # Ensure it works even if there are multiple matches
# Count missing values for the variable
missing_count <- sum(is.na(train_set[[var]]))
# Count unique levels
unique_levels_count <- n_distinct(train_set[[var]])
# Compute entropy
entropy <- calculate_entropy(summary_df$Count)
# Compute imbalance ratio
imbalance_ratio <- calculate_imbalance_ratio(summary_df$Count)
# Combine into a single row summary for the variable
final_row <- data.frame(
Variable = var,
Mode = as.character(mode_row), # Ensure Mode is always a character
Mode_Percentage = mode_percentage,
Missing_Count = missing_count,
Unique_Levels = unique_levels_count,
Entropy = round(entropy, 2),
Imbalance_Ratio = round(imbalance_ratio, 2),
stringsAsFactors = FALSE # Avoid factors unless explicitly needed
)
return(final_row)
})
# Combine summaries into a single data frame
categorical_summary_df <- bind_rows(categorical_summary)
# Print the resulting summary
categorical_summary_df %>%
kable(caption = "Summary of Categorical Variables (Including Missing Counts and Percentages)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Variable | Mode | Mode_Percentage | Missing_Count | Unique_Levels | Entropy | Imbalance_Ratio |
---|---|---|---|---|---|---|
Brand.Code | B | 48.19 | 120 | 5 | 1.93 | 10.32 |
There is a statistically significant relationship between Brand_Code and PH based on the provided analyses. We will therefore retain Brand.Code as a variables:
The mean, median, and standard deviation of PH vary across the different levels of Brand_Code, indicating differences in central tendency and variability.
The ANOVA results show a significant F-statistic with a p-value less than 0.05 (\(p < 2.2 \times 10^{-16}\)), which suggests that there are statistically significant differences in the mean PH values across the levels of Brand_Code.
The boxplots illustrate visible differences in the distributions of PH for each Brand_Code. These differences reinforce the findings from the summary statistics and ANOVA.
When PH is categorized into levels such as “Low,” “Medium,” and “High,” the Chi-Square test also indicates a significant association (\(p < 2.2 \times 10^{-16}\)). However, the warning about the Chi-Square approximation suggests caution in interpretation, potentially due to sparse data in some categories.
The statistical evidence supports a relationship between Brand_Code and PH. These findings could inform further modeling, such as including Brand_Code as a categorical predictor in statistical or machine learning models.
# Summary of PH by Brand_Code
summary_stats <- aggregate(PH ~ Brand.Code,
data = train_set,
FUN = function(x) c(mean = mean(x),
median = median(x),
sd = sd(x)))
# Convert the result into a more readable data frame
summary_df <- do.call(data.frame, summary_stats)
# Rename the columns for better understanding
colnames(summary_df) <- c("Brand_Code", "PH_Mean", "PH_Median", "PH_SD")
# Print the summary
summary_df %>%
kable(caption = "Summary of PH by Brand_Code") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Brand_Code | PH_Mean | PH_Median | PH_SD |
---|---|---|---|
A | 8.497406 | 8.52 | 0.1630744 |
B | 8.566785 | 8.56 | 0.1689752 |
C | 8.413684 | 8.42 | 0.1771266 |
D | 8.602504 | 8.62 | 0.1353614 |
# Perform ANOVA
anova_result <- aov(PH ~ Brand.Code, data = train_set)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Brand.Code 3 8.50 2.8322 108.5 <2e-16 ***
## Residuals 2443 63.76 0.0261
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 124 observations deleted due to missingness
# Visualize using boxplots
library(ggplot2)
ggplot(train_set, aes(x = Brand.Code, y = PH)) +
geom_boxplot() +
labs(title = "Boxplot of PH by Brand Code", x = "Brand Code", y = "PH") +
theme_minimal()
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Convert PH to categorical (if needed)
train_set$PH_cat <- cut(train_set$PH, breaks = 3, labels = c("Low", "Medium", "High"))
# Function to calculate mode
get_mode <- function(x) {
ux <- na.omit(unique(x))
ux[which.max(tabulate(match(x, ux)))]
}
# Impute missing values in PH_cat with the mode
mode_PH_cat <- get_mode(train_set$PH_cat)
train_set$PH_cat[is.na(train_set$PH_cat)] <- mode_PH_cat
# Perform Chi-Square Test between Brand.Code and PH_cat
chisq_test <- chisq.test(table(train_set$Brand.Code, train_set$PH_cat))
## Warning in chisq.test(table(train_set$Brand.Code, train_set$PH_cat)):
## Chi-squared approximation may be incorrect
print(chisq_test)
##
## Pearson's Chi-squared test
##
## data: table(train_set$Brand.Code, train_set$PH_cat)
## X-squared = 219.69, df = 6, p-value < 2.2e-16
The imputation methods applied in the script are outlined below,
incorporating the specialized treatment of the
Brand_Code variable and handling scenarios where the
target variable (PH
) is unavailable in the evaluation
set:
Imputation Methods Summary:
mice
package.Brand_Code
):
Brand_Code
were imputed using a multinomial
logistic regression model with PH
as the
predictor. This method leverages the observed relationship between
Brand_Code
and PH
to provide accurate and
contextually appropriate imputations.PH
is
unavailable), missing values in Brand_Code
were imputed
using mode-based imputation. The most frequent category
from the training data was assigned to ensure consistency while
addressing the absence of a predictor.PH
was explicitly excluded from imputation
in the evaluation_set
as its values are missing by design,
representing the target variable to be predicted.PH
were excluded from
the final missing values report, as they are not subject to
imputation.# Check and install necessary packages
necessary_packages <- c("mice", "dplyr", "nnet")
for (pkg in necessary_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
}
}
# Load the required libraries
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(dplyr)
library(nnet)
# Function to impute numeric variables using MICE
impute_with_mice <- function(data, exclude_vars = NULL) {
numeric_data <- data %>% select_if(is.numeric)
set.seed(123) # For reproducibility
imputed_data <- mice(numeric_data, m = 1, method = "pmm", maxit = 5, printFlag = FALSE)
data[, names(numeric_data)] <- complete(imputed_data)
return(data)
}
# Function to impute Brand.Code using Multinomial Logistic Regression
impute_brand_code <- function(data, target_col, predictor_col) {
# Ensure Brand.Code is a factor
data[[target_col]] <- factor(data[[target_col]])
# Filter data with non-missing values
train_data_non_missing <- data[!is.na(data[[target_col]]), ]
# Check if sufficient classes are present
if (length(unique(train_data_non_missing[[target_col]])) > 1) {
# Train multinomial logistic regression
model <- multinom(as.formula(paste(target_col, "~", predictor_col)), data = train_data_non_missing)
# Predict missing values
missing_indices <- is.na(data[[target_col]])
data[[target_col]][missing_indices] <- predict(model, newdata = data[missing_indices, ])
# Ensure the factor levels are consistent
data[[target_col]] <- factor(data[[target_col]], levels = levels(train_data_non_missing[[target_col]]))
} else {
# Use the mode class if only one class is present
warning("Only one class detected for imputation. Using mode imputation.")
mode_class <- names(sort(table(train_data_non_missing[[target_col]]), decreasing = TRUE))[1]
data[[target_col]][is.na(data[[target_col]])] <- mode_class
}
return(data)
}
# Function to impute Brand.Code for datasets without PH values
impute_brand_code_no_ph <- function(data, target_col, train_data) {
# Ensure Brand.Code is a factor
train_data[[target_col]] <- factor(train_data[[target_col]])
# Use the mode of Brand.Code from the training data
mode_class <- names(sort(table(train_data[[target_col]]), decreasing = TRUE))[1]
data[[target_col]][is.na(data[[target_col]])] <- mode_class
data[[target_col]] <- factor(data[[target_col]], levels = levels(train_data[[target_col]]))
return(data)
}
# Create copies of train_set and evaluation_set
train_set_imputed <- train_set
evaluation_set_imputed <- evaluation_set
# Step 1: Impute numeric variables
train_set_imputed <- impute_with_mice(train_set_imputed)
evaluation_set_imputed <- impute_with_mice(evaluation_set_imputed)
# Step 2: Impute Brand.Code
train_set_imputed <- impute_brand_code(train_set_imputed, target_col = "Brand.Code", predictor_col = "PH")
## # weights: 12 (6 variable)
## initial value 3397.807479
## iter 10 value 2806.215098
## final value 2804.972880
## converged
# Use mode-based imputation for the evaluation set as it lacks PH values
evaluation_set_imputed <- impute_brand_code_no_ph(evaluation_set_imputed, target_col = "Brand.Code", train_data = train_set_imputed)
# Step 3: Check for any remaining missing values
check_missing <- function(data) {
missing_summary <- colSums(is.na(data))
missing_summary <- missing_summary[missing_summary > 0]
return(missing_summary)
}
# Check missing values
cat("Missing values in train_set_imputed:\n")
## Missing values in train_set_imputed:
print(check_missing(train_set_imputed))
## named numeric(0)
cat("\nMissing values in evaluation_set_imputed:\n")
##
## Missing values in evaluation_set_imputed:
print(check_missing(evaluation_set_imputed))
## PH
## 267
# Structure of imputed datasets
cat("\nStructure of Train Set Imputed:\n")
##
## Structure of Train Set Imputed:
# Drop PH_cat column from train_set_imputed
if ("PH_cat" %in% colnames(train_set_imputed)) {
train_set_imputed <- train_set_imputed %>% dplyr::select(-PH_cat)
cat("PH_cat column has been removed from train_set_imputed.\n")
} else {
cat("PH_cat column does not exist in train_set_imputed.\n")
}
## PH_cat column has been removed from train_set_imputed.
The imputed train dataset was analyzed to understand the distributions, skewness, and presence of outliers among its numeric variables. The goal was to identify patterns in the data, such as normality, skewed distributions, and extreme values, while also determining variables with near-zero variance that provide limited analytical value. Below is a summary of the findings:
# Install and load required libraries
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
library(caret)
## Loading required package: lattice
library(dplyr)
library(tidyr)
# Reshape the imputed train_set to long format for numeric columns
numeric_cols <- sapply(train_set_imputed, is.numeric)
train_set_long <- pivot_longer(train_set_imputed,
cols = all_of(names(train_set_imputed)[numeric_cols]),
names_to = "variable",
values_to = "value")
# Identify variables with near-zero variance using caret::nearZeroVar
nzv_indices <- nearZeroVar(train_set_imputed, saveMetrics = TRUE)
# Extract variables with near-zero variance
nzv_vars <- rownames(nzv_indices[nzv_indices$nzv == TRUE, ])
# Output the list of variables with near-zero variance
cat("Variables with near-zero variance (not plotted):\n")
## Variables with near-zero variance (not plotted):
print(nzv_vars)
## [1] "Hyd.Pressure1"
# Filter out variables with near-zero variance from the long-format data
train_set_long_filtered <- train_set_long %>%
filter(!variable %in% nzv_vars)
# Clip extreme values to the 1st and 99th percentiles for better visualization
train_set_long_filtered <- train_set_long_filtered %>%
group_by(variable) %>%
mutate(value = pmin(pmax(value, quantile(value, 0.01, na.rm = TRUE)),
quantile(value, 0.99, na.rm = TRUE))) %>%
ungroup()
# Prepare the list of numeric variables for separate plotting
numeric_variables <- unique(train_set_long_filtered$variable)
# Set up the plotting area for histograms and boxplots
par(mfrow = c(1, 2)) # 2 columns for histogram and boxplot side-by-side
# Loop through each numeric variable to plot
for (var in numeric_variables) {
# Filter data for the current variable
var_data <- train_set_long_filtered %>% filter(variable == var) %>% pull(value)
# Plot histogram
hist(var_data, main = paste("Histogram of", var),
xlab = var, breaks = 30, col = "lightblue", border = "black")
# Plot boxplot
boxplot(var_data, main = paste("Boxplot of", var),
horizontal = TRUE, col = "lightgreen")
}
# Reset plotting layout to default
par(mfrow = c(1, 1))
To prepare the datasets for statistical and machine learning models, the following steps were applied:
Brand.Code
,
were identified and appropriately imputed to ensure data
completeness.PH
and categorical variables) to handle both
positive and negative values. This transformation stabilizes variance
and improves normality, enhancing model performance and robustness.These preprocessing steps ensure the datasets are ready for accurate and reliable analysis in statistical and machine learning models.
# Load necessary libraries
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
library(caret) # For preProcess and Yeo-Johnson transformation
library(dplyr) # For data manipulation
# Specify columns to exclude (e.g., PH, Brand.Code)
exclude_cols <- c("Brand.Code", "PH")
# Create copies of the imputed datasets for transformation
train_set_transformed <- train_set_imputed
evaluation_set_transformed <- evaluation_set_imputed
# Identify numeric columns to process in the train and evaluation sets
numeric_cols_train <- setdiff(
names(train_set_imputed)[sapply(train_set_imputed, is.numeric)],
exclude_cols
)
numeric_cols_test <- setdiff(
names(evaluation_set_imputed)[sapply(evaluation_set_imputed, is.numeric)],
exclude_cols
)
# Process each numeric column in the train set
for (col in numeric_cols_train) {
tryCatch({
# Ensure the column is numeric
if (!is.numeric(train_set_imputed[[col]])) {
stop(paste("Column", col, "is not numeric in train_set_imputed"))
}
# Perform Yeo-Johnson transformation
yeo_johnson_result <- caret::preProcess(
as.data.frame(train_set_imputed[[col]]),
method = "YeoJohnson"
)
# Apply transformation to the train set
train_set_transformed[[col]] <- predict(
yeo_johnson_result,
as.data.frame(train_set_imputed[[col]])
)[, 1]
}, error = function(e) {
cat(paste("Error processing column", col, ":", e$message, "\n"))
})
}
# Process each numeric column in the evaluation set
for (col in numeric_cols_test) {
tryCatch({
# Ensure the column is numeric
if (!is.numeric(evaluation_set_imputed[[col]])) {
stop(paste("Column", col, "is not numeric in evaluation_set_imputed"))
}
# Perform Yeo-Johnson transformation
yeo_johnson_result <- caret::preProcess(
as.data.frame(evaluation_set_imputed[[col]]),
method = "YeoJohnson"
)
# Apply transformation to the evaluation set
evaluation_set_transformed[[col]] <- predict(
yeo_johnson_result,
as.data.frame(evaluation_set_imputed[[col]])
)[, 1]
}, error = function(e) {
cat(paste("Error processing column", col, ":", e$message, "\n"))
})
}
# Output the structure of transformed train and evaluation sets
cat("\n\nStructure of Transformed Train Set (Yeo-Johnson Applied):\n\n")
##
##
## Structure of Transformed Train Set (Yeo-Johnson Applied):
str(train_set_transformed)
## 'data.frame': 2571 obs. of 33 variables:
## $ Brand.Code : Factor w/ 4 levels "A","B","C","D": 2 1 2 1 1 1 1 2 2 2 ...
## $ Carb.Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill.Ounces : num 729 732 735 732 752 ...
## $ PC.Volume : num 0.208 0.193 0.208 0.227 0.1 ...
## $ Carb.Pressure : num 2.14 2.14 2.15 2.12 2.14 ...
## $ Carb.Temp : num 0.5 0.5 0.5 0.5 0.5 ...
## $ PSC : num 0.104 0.124 0.09 0.072 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC.Fill : num 0.169 0.152 0.199 0.222 0.121 ...
## $ PSC.CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf.Flow : num -114 -114 -114 -114 -114 ...
## $ Carb.Pressure1 : num 21.2 21.5 21.4 20.9 21.2 ...
## $ Fill.Pressure : num 1.41 1.41 1.41 1.41 1.41 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 2.64 2.61 2.54 2.57 2.57 ...
## $ Filler.Level : num 121 119 120 118 119 ...
## $ Filler.Speed : int 4002 3986 4020 4012 4010 4014 1010 1004 4014 4028 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage.cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb.Flow : num 24689 27033 24492 26121 26032 ...
## $ Density : num 0.453 0.463 0.584 0.579 0.579 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 0.628 0.648 0.848 0.84 0.84 ...
## $ Pressure.Vacuum : num -44.9 -44.9 -39.6 -56.9 -56.9 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen.Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl.Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 143 142 146 146 ...
## $ Alch.Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb.Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling.Lvl : num 0.678 0.695 0.925 0.903 0.903 ...
#cat("\n\nStructure of Transformed Evaluation Set (Yeo-Johnson Applied):\n\n")
#str(evaluation_set_transformed)
Hyd.Pressure2
, MFR
, and
Carb.Volume
show substantial decreases in skewness,
transitioning towards more symmetric distributions favorable for
modeling.Temperature
and PH
, with low initial skewness,
show little to no effect, indicating that the transformation preserves
already symmetric distributions.# Load necessary library
if (!requireNamespace("e1071", quietly = TRUE)) {
install.packages("e1071")
}
library(e1071) # For calculating skewness
# Select numeric columns in the original and transformed datasets
numeric_cols <- names(train_set_imputed)[sapply(train_set_imputed, is.numeric)]
# Calculate skewness for train_set_imputed (Before transformation)
skewness_imputed <- sapply(train_set_imputed[numeric_cols], skewness, na.rm = TRUE)
# Calculate skewness for train_set_transformed (After Yeo-Johnson transformation)
skewness_transformed <- sapply(train_set_transformed[numeric_cols], skewness, na.rm = TRUE)
# Combine the skewness results into a data frame for comparison
skewness_comparison <- data.frame(
Variable = numeric_cols,
Skewness_Before = skewness_imputed,
Skewness_After = skewness_transformed
)
# Reshape the data for plotting
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
skewness_long <- melt(skewness_comparison, id.vars = "Variable",
variable.name = "Skewness_Type", value.name = "Skewness")
# Plot the skewness comparison
library(ggplot2)
ggplot(skewness_long, aes(x = Variable, y = Skewness, fill = Skewness_Type)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(
title = "Skewness Comparison: Before and After Yeo-Johnson Transformation",
x = "Variable",
y = "Skewness"
) +
theme_minimal() +
theme(axis.text.y = element_text(size = 8), legend.position = "top")
Below is the datasets for gradient-based models,
tree-based models, and statistical
models, reflecting the processing steps and appropriate
handling of categorical variables like Brand_Code
:
For gradient-based models: - Min-max scaling is
applied to all numeric variables, including the target variable
PH
. - One-hot encoding is applied to
Brand_Code
without dropping any dummy variables.
# Install required packages if not already installed
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("fastDummies", quietly = TRUE)) {
install.packages("fastDummies")
}
# Load libraries
library(caret)
library(fastDummies)
# Set seed for reproducibility
set.seed(123)
# Step 1: Identify Numeric Columns (Excluding "PH")
numeric_cols <- setdiff(names(filtered_data_final)[sapply(filtered_data_final, is.numeric)], "PH")
# Step 2: Scaling the Training Dataset
gradient_based_data <- filtered_data_final
# Save scaling parameters during training
min_max_scaler <- preProcess(gradient_based_data[, numeric_cols], method = "range")
gradient_based_data[, numeric_cols] <- predict(min_max_scaler, gradient_based_data[, numeric_cols])
# Save min-max scaling parameters for reverse scaling later
gbm_scaling_params <- data.frame(
feature = numeric_cols,
min = apply(filtered_data_final[, numeric_cols], 2, min),
max = apply(filtered_data_final[, numeric_cols], 2, max)
)
# Step 3: Scaling the Evaluation Dataset (Excluding "PH")
gradient_models_evaluation_data <- filtered_evaluation_final # Replace with actual evaluation dataset
# Convert PH to numeric
gradient_models_evaluation_data$PH <- as.numeric(as.character(gradient_models_evaluation_data$PH))
# Apply the same scaling to the numeric columns (excluding "PH")
numeric_evaluation_cols <- setdiff(names(gradient_models_evaluation_data)[sapply(gradient_models_evaluation_data, is.numeric)], "PH")
gradient_models_evaluation_data[, numeric_evaluation_cols] <- predict(min_max_scaler, gradient_models_evaluation_data[, numeric_evaluation_cols])
# Step 4: Reordering Columns to Match Training Data
gradient_models_evaluation_data <- gradient_models_evaluation_data[, colnames(gradient_based_data), drop = FALSE]
# Step 5: One-Hot Encoding for Categorical Variables
gradient_based_data <- dummy_cols(
gradient_based_data,
select_columns = "Brand.Code",
remove_first_dummy = FALSE, # Keep all dummy variables
remove_selected_columns = TRUE
)
gradient_models_evaluation_data <- dummy_cols(
gradient_models_evaluation_data,
select_columns = "Brand.Code",
remove_first_dummy = FALSE,
remove_selected_columns = TRUE
)
# Ensure PH remains numeric in evaluation dataset
gradient_models_evaluation_data$PH <- as.numeric(gradient_models_evaluation_data$PH)
# Step 6: Preparing Evaluation Data for Prediction
gradient_models_evaluation_data_final <- gradient_models_evaluation_data[, !colnames(gradient_models_evaluation_data) %in% "PH"]
# Step 9: Display gradient_based_data of Final Datasets
cat("\n\nStructure of statistical_models_data:\n")
##
##
## Structure of statistical_models_data:
str(gradient_based_data)
## 'data.frame': 2571 obs. of 26 variables:
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Fill.Ounces : num 0.481 0.539 0.617 0.539 0.99 ...
## $ PC.Volume : num 0.54 0.477 0.54 0.613 0.107 ...
## $ Carb.Pressure : num 0.556 0.564 0.667 0.314 0.511 ...
## $ PSC : num 0.3806 0.4552 0.3284 0.2612 0.0896 ...
## $ PSC.Fill : num 0.646 0.579 0.757 0.846 0.461 ...
## $ PSC.CO2 : num 0.167 0.167 0.667 0.167 0.5 ...
## $ Mnf.Flow : num 0.000757 0.000757 0.000757 0.000757 0.000757 ...
## $ Carb.Pressure1 : num 0.398 0.479 0.439 0.291 0.386 ...
## $ Fill.Pressure : num 0.554 0.554 0.554 0.569 0.546 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 0.835 0.736 0.486 0.6 0.6 ...
## $ Temperature : num 0.19 0.317 0.27 0.159 0.159 ...
## $ Usage.cont : num 0.297 0.566 0.411 0.386 0.405 ...
## $ Carb.Flow : num 0.486 0.532 0.482 0.514 0.513 ...
## $ MFR : num 0.828 0.831 0.84 0.835 0.826 ...
## $ Pressure.Vacuum : num 0.92 0.92 0.962 0.827 0.827 ...
## $ Oxygen.Filler : num 0.0493 0.0594 0.0543 0.0694 0.0694 ...
## $ Bowl.Setpoint : num 0.714 0.714 0.714 0.714 0.714 ...
## $ Pressure.Setpoint: num 0.3 0.35 0.325 0.25 0.25 ...
## $ Air.Pressurer : num 0.243 0.297 0.162 0.73 0.73 ...
## $ Carb.Rel : num 0.327 0.309 0.8 0.418 0.436 ...
## $ Brand.Code_A : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Brand.Code_B : int 1 0 1 0 0 0 0 1 1 1 ...
## $ Brand.Code_C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Brand.Code_D : int 0 0 0 0 0 0 0 0 0 0 ...
cat("\n\nStructure of gradient_models_evaluation_data:\n")
##
##
## Structure of gradient_models_evaluation_data:
str(gradient_models_evaluation_data_final)
## 'data.frame': 267 obs. of 25 variables:
## $ Fill.Ounces : num -14.7 -14.7 -14.7 -14.7 -14.7 ...
## $ PC.Volume : num 0.819 0.636 0.96 0.466 0.358 ...
## $ Carb.Pressure : num -23.8 -23.8 -23.8 -23.8 -23.8 ...
## $ PSC : num 0.87313 0.14925 0.24627 0.00746 0.14179 ...
## $ PSC.Fill : num 0.817 0.575 0.317 0.539 0.699 ...
## $ PSC.CO2 : num 0.1667 0.3333 0.0833 0.0833 0.25 ...
## $ Mnf.Flow : num -0.00467 -0.00467 -0.00467 -0.00467 -0.00467 ...
## $ Carb.Pressure1 : num -2.95 -2.93 -2.92 -2.88 -2.97 ...
## $ Fill.Pressure : num 185 185 184 170 198 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 4.48 4.82 4.53 5.17 4.43 ...
## $ Temperature : num 0.19 0.159 0.159 0.857 0.222 ...
## $ Usage.cont : num 0.693 0.399 0.876 0.437 0.669 ...
## $ Carb.Flow : num 0.090152 0.089034 0.093644 -0.000437 0.098863 ...
## $ MFR : num 0.832 0.841 0.84 0.336 0.861 ...
## $ Pressure.Vacuum : num 0.895 0.72 0.783 0.841 0.841 ...
## $ Oxygen.Filler : num 0.0493 0.0694 0.1097 0.2103 0.2002 ...
## $ Bowl.Setpoint : num 0.857 0.714 0.714 0.714 0.714 ...
## $ Pressure.Setpoint: num -5.44 -5.44 -5.44 -5.44 -5.44 ...
## $ Air.Pressurer : num 0.243 0.865 0.784 0.757 0.676 ...
## $ Carb.Rel : num 0.345 0.564 0.345 0.491 0.382 ...
## $ Brand.Code_A : int 0 1 0 0 0 0 1 0 1 0 ...
## $ Brand.Code_B : int 0 0 1 1 1 1 0 1 0 0 ...
## $ Brand.Code_C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Brand.Code_D : int 1 0 0 0 0 0 0 0 0 1 ...
For tree-based models: - The dataset is used as is, without scaling
numeric variables or transforming the target variable PH
. -
Label encoding is applied to the
Brand_Code
variable instead of one-hot encoding.
# Label Encoding for Tree-Based Models
tree_based_data <- train_set_transformed
tree_based_evaluation_data <- evaluation_set_transformed
# Convert Brand_Code to numeric labels
tree_based_data$Brand_Code <- as.numeric(factor(tree_based_data$Brand.Code))
tree_based_evaluation_data$Brand_Code <- as.numeric(factor(tree_based_evaluation_data$Brand.Code))
# Check structure of the final dataset
str(tree_based_data)
## 'data.frame': 2571 obs. of 34 variables:
## $ Brand.Code : Factor w/ 4 levels "A","B","C","D": 2 1 2 1 1 1 1 2 2 2 ...
## $ Carb.Volume : num 5.34 5.43 5.29 5.44 5.49 ...
## $ Fill.Ounces : num 729 732 735 732 752 ...
## $ PC.Volume : num 0.208 0.193 0.208 0.227 0.1 ...
## $ Carb.Pressure : num 2.14 2.14 2.15 2.12 2.14 ...
## $ Carb.Temp : num 0.5 0.5 0.5 0.5 0.5 ...
## $ PSC : num 0.104 0.124 0.09 0.072 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC.Fill : num 0.169 0.152 0.199 0.222 0.121 ...
## $ PSC.CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf.Flow : num -114 -114 -114 -114 -114 ...
## $ Carb.Pressure1 : num 21.2 21.5 21.4 20.9 21.2 ...
## $ Fill.Pressure : num 1.41 1.41 1.41 1.41 1.41 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 2.64 2.61 2.54 2.57 2.57 ...
## $ Filler.Level : num 121 119 120 118 119 ...
## $ Filler.Speed : int 4002 3986 4020 4012 4010 4014 1010 1004 4014 4028 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage.cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb.Flow : num 24689 27033 24492 26121 26032 ...
## $ Density : num 0.453 0.463 0.584 0.579 0.579 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Balling : num 0.628 0.648 0.848 0.84 0.84 ...
## $ Pressure.Vacuum : num -44.9 -44.9 -39.6 -56.9 -56.9 ...
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Oxygen.Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl.Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 143 142 146 146 ...
## $ Alch.Rel : num 6.58 6.56 7.66 7.14 7.14 7.16 6.54 6.52 6.52 6.54 ...
## $ Carb.Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Balling.Lvl : num 0.678 0.695 0.925 0.903 0.903 ...
## $ Brand_Code : num 2 1 2 1 1 1 1 2 2 2 ...
For statistical models: - Standardization (mean
centering and scaling to unit variance) is applied to all numeric
variables, including the target variable PH
. -
One-hot encoding is applied to Brand_Code
,
with one dummy variable dropped to avoid multicollinearity.
# Install required packages if not already installed
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("fastDummies", quietly = TRUE)) {
install.packages("fastDummies")
}
# Load libraries
library(caret)
library(fastDummies)
# Set seed for reproducibility
set.seed(123)
# Step 1: Identify Numeric Columns (Excluding "PH")
numeric_cols <- setdiff(names(filtered_data_final)[sapply(filtered_data_final, is.numeric)], "PH")
# Step 2: Standardizing the Training Dataset
statistical_models_data <- filtered_data_final # Replace with actual training dataset
# Save standardization parameters during training
standard_scaler <- preProcess(statistical_models_data[, numeric_cols], method = c("center", "scale"))
statistical_models_data[, numeric_cols] <- predict(standard_scaler, statistical_models_data[, numeric_cols])
# Save standardization parameters for potential reverse standardization later
statistical_scaling_params <- data.frame(
feature = numeric_cols,
mean = apply(filtered_data_final[, numeric_cols], 2, mean),
sd = apply(filtered_data_final[, numeric_cols], 2, sd)
)
# Step 3: Standardizing the Evaluation Dataset
statistical_models_evaluation_data <- filtered_evaluation_final # Replace with actual evaluation dataset
# Convert PH to numeric
statistical_models_evaluation_data$PH <- as.numeric(as.character(statistical_models_evaluation_data$PH))
# Apply standardization to numeric columns in the evaluation dataset (excluding "PH")
numeric_evaluation_cols <- setdiff(names(statistical_models_evaluation_data)[sapply(statistical_models_evaluation_data, is.numeric)], "PH")
statistical_models_evaluation_data[, numeric_evaluation_cols] <- predict(standard_scaler, statistical_models_evaluation_data[, numeric_evaluation_cols])
# Step 4: One-Hot Encoding for Categorical Variables (Drop one dummy)
statistical_models_data <- dummy_cols(
statistical_models_data,
select_columns = "Brand.Code",
remove_first_dummy = TRUE, # Drop one dummy variable
remove_selected_columns = TRUE
)
statistical_models_evaluation_data <- dummy_cols(
statistical_models_evaluation_data,
select_columns = "Brand.Code",
remove_first_dummy = TRUE, # Drop one dummy variable
remove_selected_columns = TRUE
)
# Ensure PH remains numeric in the evaluation dataset
statistical_models_evaluation_data$PH <- as.numeric(statistical_models_evaluation_data$PH)
# Step 5: Reordering Columns
# Ensure the evaluation data columns match the training data
statistical_models_evaluation_data <- statistical_models_evaluation_data[, colnames(statistical_models_data), drop = FALSE]
if (identical(colnames(statistical_models_data), colnames(statistical_models_evaluation_data))) {
cat("The column ordering is identical between the training and evaluation datasets.\n")
} else {
cat("The column ordering is NOT identical. Differences:\n")
setdiff(colnames(statistical_models_data), colnames(statistical_models_evaluation_data))
}
## The column ordering is identical between the training and evaluation datasets.
# Step 6: Checking the Structure of Final Datasets
cat("\n\nStructure of statistical_models_data:\n")
##
##
## Structure of statistical_models_data:
str(statistical_models_data)
## 'data.frame': 2571 obs. of 25 variables:
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Fill.Ounces : num -0.0937 0.3642 0.9761 0.3642 3.9061 ...
## $ PC.Volume : num -0.192 -0.615 -0.192 0.299 -3.092 ...
## $ Carb.Pressure : num 0.0315 0.0874 0.7418 -1.5053 -0.2515 ...
## $ PSC : num 0.395 0.801 0.111 -0.254 -1.186 ...
## $ PSC.Fill : num 0.711 0.397 1.237 1.654 -0.157 ...
## $ PSC.CO2 : num -0.382 -0.382 2.389 -0.382 1.466 ...
## $ Mnf.Flow : num -1.05 -1.05 -1.05 -1.05 -1.05 ...
## $ Carb.Pressure1 : num -0.793 -0.197 -0.494 -1.569 -0.879 ...
## $ Fill.Pressure : num -0.573 -0.573 -0.573 -0.438 -0.64 ...
## $ Hyd.Pressure2 : num -1.33 -1.33 -1.33 -1.33 -1.33 ...
## $ Hyd.Pressure4 : num 1.531 0.772 -1.135 -0.264 -0.264 ...
## $ Temperature : num 0.0192 1.1666 0.7363 -0.2677 -0.2677 ...
## $ Usage.cont : num -1.617 -0.367 -1.086 -1.2 -1.113 ...
## $ Carb.Flow : num 0.401 0.633 0.382 0.543 0.534 ...
## $ MFR : num 0.395 0.408 0.468 0.436 0.38 ...
## $ Pressure.Vacuum : num 1.89 1.89 2.11 1.39 1.39 ...
## $ Oxygen.Filler : num -0.534 -0.449 -0.491 -0.364 -0.364 ...
## $ Bowl.Setpoint : num 0.698 0.698 0.698 0.698 0.698 ...
## $ Pressure.Setpoint: num -0.594 -0.398 -0.496 -0.791 -0.791 ...
## $ Air.Pressurer : num -0.193 0.137 -0.688 2.777 2.777 ...
## $ Carb.Rel : num -0.9058 -1.061 3.1308 -0.1295 0.0257 ...
## $ Brand.Code_B : int 1 0 1 0 0 0 0 1 1 1 ...
## $ Brand.Code_C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Brand.Code_D : int 0 0 0 0 0 0 0 0 0 0 ...
cat("\n\nStructure of statistical_models_evaluation_data:\n")
##
##
## Structure of statistical_models_evaluation_data:
str(statistical_models_evaluation_data)
## 'data.frame': 267 obs. of 25 variables:
## $ PH : num NA NA NA NA NA NA NA NA NA NA ...
## $ Fill.Ounces : num -120 -120 -120 -120 -120 ...
## $ PC.Volume : num 1.672 0.45 2.62 -0.689 -1.412 ...
## $ Carb.Pressure : num -155 -155 -155 -155 -155 ...
## $ PSC : num 3.071 -0.862 -0.335 -1.632 -0.902 ...
## $ PSC.Fill : num 1.52 0.379 -0.837 0.209 0.963 ...
## $ PSC.CO2 : num -0.3817 0.542 -0.8436 -0.8436 0.0801 ...
## $ Mnf.Flow : num -1.06 -1.06 -1.06 -1.06 -1.06 ...
## $ Carb.Pressure1 : num -25.3 -25.1 -25 -24.7 -25.4 ...
## $ Fill.Pressure : num 1610 1614 1606 1478 1721 ...
## $ Hyd.Pressure2 : num -1.33 -1.33 -1.33 -1.33 -1.33 ...
## $ Hyd.Pressure4 : num 29.4 32 29.7 34.7 29 ...
## $ Temperature : num 0.0192 -0.2677 -0.2677 6.0431 0.306 ...
## $ Usage.cont : num 0.224 -1.14 1.07 -0.965 0.11 ...
## $ Carb.Flow : num -1.58 -1.58 -1.56 -2.03 -1.53 ...
## $ MFR : num 0.414 0.474 0.466 -2.585 0.591 ...
## $ Pressure.Vacuum : num 1.75 0.816 1.155 1.466 1.466 ...
## $ Oxygen.Filler : num -0.5339 -0.3636 -0.0229 0.8289 0.7437 ...
## $ Bowl.Setpoint : num 1.35 0.698 0.698 0.698 0.698 ...
## $ Pressure.Setpoint: num -23.1 -23.1 -23.1 -23.1 -23.1 ...
## $ Air.Pressurer : num -0.193 3.603 3.107 2.942 2.447 ...
## $ Carb.Rel : num -0.751 1.112 -0.751 0.491 -0.44 ...
## $ Brand.Code_B : int 0 0 1 1 1 1 0 1 0 0 ...
## $ Brand.Code_C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Brand.Code_D : int 1 0 0 0 0 0 0 0 0 1 ...
For statistical models: - One-hot encoding is
applied to Brand_Code
, with one dummy variable dropped to
avoid multicollinearity.
# Install required packages if not already installed
if (!requireNamespace("fastDummies", quietly = TRUE)) {
install.packages("fastDummies")
}
# Load libraries
library(fastDummies)
# Set seed for reproducibility
set.seed(123)
# Step 1: Identify Numeric Columns (Excluding "PH")
numeric_cols <- setdiff(names(filtered_data_final)[sapply(filtered_data_final, is.numeric)], "PH")
# Step 2: Assign the Training Dataset
statistical_models_data_raw <- filtered_data_final # Replace with actual training dataset
# Step 3: Assign the Evaluation Dataset
statistical_models_evaluation_data_raw <- filtered_evaluation_final # Replace with actual evaluation dataset
# Convert PH to numeric in the evaluation dataset
statistical_models_data_raw$PH <- as.numeric(as.character(statistical_models_data_raw$PH))
# Step 4: One-Hot Encoding for Categorical Variables (Drop one dummy)
statistical_models_data_raw <- dummy_cols(
statistical_models_data_raw,
select_columns = "Brand.Code",
remove_first_dummy = TRUE, # Drop one dummy variable
remove_selected_columns = TRUE
)
statistical_models_evaluation_data_raw <- dummy_cols(
statistical_models_evaluation_data_raw,
select_columns = "Brand.Code",
remove_first_dummy = TRUE, # Drop one dummy variable
remove_selected_columns = TRUE
)
# Step 5: Reordering Columns
# Ensure the evaluation data columns match the training data
statistical_models_evaluation_data_raw <- statistical_models_evaluation_data_raw[, colnames(statistical_models_data_raw), drop = FALSE]
if (identical(colnames(statistical_models_data_raw), colnames(statistical_models_evaluation_data_raw))) {
cat("The column ordering is identical between the training and evaluation datasets.\n")
} else {
cat("The column ordering is NOT identical. Differences:\n")
setdiff(colnames(statistical_models_data_raw), colnames(statistical_models_evaluation_data_raw))
}
## The column ordering is identical between the training and evaluation datasets.
# Step 6: Checking the Structure of Final Datasets
cat("\n\nStructure of statistical_models_data:\n")
##
##
## Structure of statistical_models_data:
str(statistical_models_data_raw)
## 'data.frame': 2571 obs. of 25 variables:
## $ PH : num 8.36 8.26 8.94 8.24 8.26 8.32 8.4 8.38 8.38 8.5 ...
## $ Fill.Ounces : num 729 732 735 732 752 ...
## $ PC.Volume : num 0.208 0.193 0.208 0.227 0.1 ...
## $ Carb.Pressure : num 2.14 2.14 2.15 2.12 2.14 ...
## $ PSC : num 0.104 0.124 0.09 0.072 0.026 0.09 0.128 0.154 0.132 0.014 ...
## $ PSC.Fill : num 0.169 0.152 0.199 0.222 0.121 ...
## $ PSC.CO2 : num 0.04 0.04 0.16 0.04 0.12 0.04 0.04 0.04 0.14 0.06 ...
## $ Mnf.Flow : num -114 -114 -114 -114 -114 ...
## $ Carb.Pressure1 : num 21.2 21.5 21.4 20.9 21.2 ...
## $ Fill.Pressure : num 1.41 1.41 1.41 1.41 1.41 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 2.64 2.61 2.54 2.57 2.57 ...
## $ Temperature : num 66 67.6 67 65.6 65.6 66.2 65.8 65.2 65.4 66.6 ...
## $ Usage.cont : num 16.2 19.9 17.8 17.4 17.7 ...
## $ Carb.Flow : num 24689 27033 24492 26121 26032 ...
## $ MFR : num 725 727 735 731 723 ...
## $ Pressure.Vacuum : num -44.9 -44.9 -39.6 -56.9 -56.9 ...
## $ Oxygen.Filler : num 0.022 0.026 0.024 0.03 0.03 0.024 0.066 0.046 0.064 0.022 ...
## $ Bowl.Setpoint : int 120 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 46.4 46.8 46.6 46 46 46 46 46 46 46 ...
## $ Air.Pressurer : num 143 143 142 146 146 ...
## $ Carb.Rel : num 5.32 5.3 5.84 5.42 5.44 5.44 5.38 5.34 5.34 5.34 ...
## $ Brand.Code_B : int 1 0 1 0 0 0 0 1 1 1 ...
## $ Brand.Code_C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Brand.Code_D : int 0 0 0 0 0 0 0 0 0 0 ...
cat("\n\nStructure of statistical_models_evaluation_data:\n")
##
##
## Structure of statistical_models_evaluation_data:
str(statistical_models_evaluation_data_raw)
## 'data.frame': 267 obs. of 25 variables:
## $ PH : logi NA NA NA NA NA NA ...
## $ Fill.Ounces : num 24 24 23.9 23.9 24.2 ...
## $ PC.Volume : num 0.278 0.232 0.313 0.19 0.163 ...
## $ Carb.Pressure : num 0.471 0.471 0.471 0.471 0.471 ...
## $ PSC : num 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
## $ PSC.Fill : num 0.2144 0.1509 0.0832 0.1414 0.1834 ...
## $ PSC.CO2 : num 0.04 0.08 0.02 0.02 0.06 0.02 0 0.04 0.04 0.02 ...
## $ Mnf.Flow : num -115 -115 -115 -115 -115 ...
## $ Carb.Pressure1 : num 9.74 9.81 9.85 9.99 9.69 ...
## $ Fill.Pressure : num 9.73 9.76 9.71 9.05 10.31 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 3.68 3.77 3.69 3.88 3.66 ...
## $ Temperature : num 66 65.6 65.6 74.4 66.4 66.6 66.8 70.6 65.8 66 ...
## $ Usage.cont : num 21.7 17.6 24.2 18.1 21.3 ...
## $ Carb.Flow : num 4624 4567.4 4801 32.9 5065.5 ...
## $ MFR : num 728 736 735 313 752 ...
## $ Pressure.Vacuum : num -48.2 -70.6 -62.5 -55 -55 ...
## $ Oxygen.Filler : num 0.022 0.03 0.046 0.086 0.082 0.064 0.042 0.096 0.046 0.096 ...
## $ Bowl.Setpoint : int 130 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 0.452 0.452 0.452 0.452 0.452 ...
## $ Air.Pressurer : num 143 147 147 146 146 ...
## $ Carb.Rel : num 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
## $ Brand.Code_B : int 0 0 1 1 1 1 0 1 0 0 ...
## $ Brand.Code_C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Brand.Code_D : int 1 0 0 0 0 0 0 0 0 1 ...
Multiple Linear Regression (MLR) models will be built.
The Interaction Terms Model emerged as the best model with the lowest Test_RMSE (0.7320), a relatively high Train_R2 (0.6343), and a balanced generalization ability. This indicates that incorporating interactions between variables significantly enhances predictive performance.
# Load required libraries
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("Metrics", quietly = TRUE)) {
install.packages("Metrics")
}
library(caret)
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Split data into training (80%) and testing (20%) sets
ols_train_indices <- createDataPartition(statistical_models_data_raw$PH, p = 0.8, list = FALSE)
ols_train_data <- statistical_models_data_raw[ols_train_indices, ]
ols_test_data <- statistical_models_data_raw[-ols_train_indices, ]
### Full Model ###
full_model <- lm(PH ~ ., data = ols_train_data)
# Null model (intercept only)
null_model <- lm(PH ~ 1, data = ols_train_data)
# Function to evaluate models
evaluate_model <- function(model, ols_train_data, ols_test_data) {
train_predictions <- predict(model, newdata = ols_train_data)
test_predictions <- predict(model, newdata = ols_test_data)
# Training Metrics
train_r2 <- summary(model)$r.squared
train_mse <- mean((ols_train_data$PH - train_predictions)^2)
train_rmse <- sqrt(train_mse)
# Testing Metrics
test_r2 <- 1 - sum((ols_test_data$PH - test_predictions)^2) / sum((ols_test_data$PH - mean(ols_test_data$PH))^2)
test_mse <- mean((ols_test_data$PH - test_predictions)^2)
test_rmse <- sqrt(test_mse)
return(list(
Train_R2 = train_r2,
Train_MSE = train_mse,
Train_RMSE = train_rmse,
Test_R2 = test_r2,
Test_MSE = test_mse,
Test_RMSE = test_rmse
))
}
### 1. Forward Selection ###
forward_model <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward",
trace = 0)
forward_metrics <- evaluate_model(forward_model, ols_train_data, ols_test_data)
### 2. Backward Elimination ###
backward_model <- step(full_model,
direction = "backward",
trace = 0)
backward_metrics <- evaluate_model(backward_model, ols_train_data, ols_test_data)
### 3. Stepwise Selection ###
stepwise_model <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both",
trace = 0)
stepwise_metrics <- evaluate_model(stepwise_model, ols_train_data, ols_test_data)
### 4. Recursive Feature Elimination (RFE) ###
control <- rfeControl(functions = lmFuncs, method = "cv", number = 10)
rfe_model <- rfe(ols_train_data[, -which(names(ols_train_data) == "PH")],
ols_train_data$PH,
sizes = c(1:10),
rfeControl = control)
# Fit a model using RFE-selected predictors
selected_vars <- predictors(rfe_model)
rfe_final_model <- lm(as.formula(paste("PH ~", paste(selected_vars, collapse = " + "))), data = ols_train_data)
rfe_metrics <- evaluate_model(rfe_final_model, ols_train_data, ols_test_data)
### 5. Interaction Terms ###
interaction_model <- lm(PH ~ .^2, data = ols_train_data) # Includes all pairwise interactions
interaction_metrics <- evaluate_model(interaction_model, ols_train_data, ols_test_data)
### 6. Polynomial Features ###
polynomial_formula <- as.formula(
paste("PH ~ poly(Fill.Ounces, 2) + poly(PC.Volume, 2) + poly(Carb.Pressure, 2) + .", collapse = "+")
)
polynomial_model <- lm(polynomial_formula, data = ols_train_data)
polynomial_metrics <- evaluate_model(polynomial_model, ols_train_data, ols_test_data)
### 7. Feature Engineering ###
ols_train_data_fe <- ols_train_data %>%
mutate(
Pressure_Ratio = Fill.Pressure / Carb.Pressure,
Temperature_Deviation = abs(Temperature - mean(Temperature, na.rm = TRUE)),
PSC_Carb_Interaction = PSC * Carb.Pressure
)
ols_test_data_fe <- ols_test_data %>%
mutate(
Pressure_Ratio = Fill.Pressure / Carb.Pressure,
Temperature_Deviation = abs(Temperature - mean(ols_train_data$Temperature, na.rm = TRUE)),
PSC_Carb_Interaction = PSC * Carb.Pressure
)
fe_model <- lm(PH ~ ., data = ols_train_data_fe)
fe_metrics <- evaluate_model(fe_model, ols_train_data_fe, ols_test_data_fe)
### Combine All Metrics into a Single DataFrame ###
metrics_df <- data.frame(
Model = c(
"MLR (All Features Included)",
"Forward Selection",
"Backward Elimination",
"Stepwise Selection",
"Recursive Feature Elimination (RFE)",
"Interaction Terms",
"Polynomial Features",
"Feature Engineering"
),
Train_R2 = c(
summary(full_model)$r.squared,
forward_metrics$Train_R2,
backward_metrics$Train_R2,
stepwise_metrics$Train_R2,
rfe_metrics$Train_R2,
interaction_metrics$Train_R2,
polynomial_metrics$Train_R2,
fe_metrics$Train_R2
),
Test_R2 = c(
evaluate_model(full_model, ols_train_data, ols_test_data)$Test_R2,
forward_metrics$Test_R2,
backward_metrics$Test_R2,
stepwise_metrics$Test_R2,
rfe_metrics$Test_R2,
interaction_metrics$Test_R2,
polynomial_metrics$Test_R2,
fe_metrics$Test_R2
),
Train_RMSE = c(
evaluate_model(full_model, ols_train_data, ols_test_data)$Train_RMSE,
forward_metrics$Train_RMSE,
backward_metrics$Train_RMSE,
stepwise_metrics$Train_RMSE,
rfe_metrics$Train_RMSE,
interaction_metrics$Train_RMSE,
polynomial_metrics$Train_RMSE,
fe_metrics$Train_RMSE
),
Test_RMSE = c(
evaluate_model(full_model, ols_train_data, ols_test_data)$Test_RMSE,
forward_metrics$Test_RMSE,
backward_metrics$Test_RMSE,
stepwise_metrics$Test_RMSE,
rfe_metrics$Test_RMSE,
interaction_metrics$Test_RMSE,
polynomial_metrics$Test_RMSE,
fe_metrics$Test_RMSE
)
)
# Sort metrics_df by Test_RMSE in ascending order
metrics_df <- metrics_df %>%
arrange(Test_RMSE)
# Display the sorted DataFrame
metrics_df
## Model Train_R2 Test_R2 Train_RMSE Test_RMSE
## 1 Interaction Terms 0.6187726 0.4706976 0.1054870 0.1300389
## 2 Feature Engineering 0.4023797 0.3678177 0.1320747 0.1421158
## 3 MLR (All Features Included) 0.4016072 0.3677272 0.1321600 0.1421260
## 4 Recursive Feature Elimination (RFE) 0.4016072 0.3677272 0.1321600 0.1421260
## 5 Backward Elimination 0.4005068 0.3659733 0.1322815 0.1423230
## 6 Forward Selection 0.4005068 0.3659733 0.1322815 0.1423230
## 7 Stepwise Selection 0.4005068 0.3659733 0.1322815 0.1423230
## 8 Polynomial Features 0.4054491 0.3635935 0.1317351 0.1425899
Residual Diagnostics Statement:
Residuals vs Fitted Plot: The residuals exhibit a random scatter around the zero line, indicating linearity. However, some slight heteroscedasticity may be present as the spread of residuals appears to increase slightly for higher fitted values.
Q-Q Plot: The residuals generally follow the theoretical quantiles, suggesting that the normality assumption is approximately satisfied, though some deviation is observed in the tails.
Scale-Location Plot: The spread of standardized residuals is relatively consistent across fitted values, supporting the homoscedasticity assumption.
Residuals vs Leverage Plot: A few high-leverage points are present, as indicated by Cook’s distance. These points may influence the model significantly and warrant further investigation.
Overall, the diagnostic plots indicate that the model assumptions are reasonably met, though some high-leverage points may require attention.
# Generate diagnostic plots for the training model
par(mfrow = c(2, 2))
plot(interaction_model)
par(mfrow = c(1, 1)) # Reset plotting layout
# summary(interaction_model)
train_X
, test_X
) and the
response variable (train_y
, test_y
) were
prepared.Based on the sorted DataFrame output, Lasso Regression is the best-performing model, having the lowest Test RMSE (\(0.8078\)) and competitive \(R^2\) values. This suggests that it balances bias and variance effectively for this dataset.
If further analysis is required or if you want visualizations of model performances or additional fine-tuning, let me know!
# Install required packages
if (!requireNamespace("pls", quietly = TRUE)) {
install.packages("pls")
}
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("glmnet", quietly = TRUE)) {
install.packages("glmnet")
}
# Load required libraries
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(caret)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
# Prepare predictors (X) and response (y)
train_X <- as.matrix(ols_train_data[, -which(names(ols_train_data) == "PH")])
train_y <- ols_train_data$PH
test_X <- as.matrix(ols_test_data[, -which(names(ols_test_data) == "PH")])
test_y <- ols_test_data$PH
### 1. Partial Least Squares Regression (PLSR)
# Fit PLS model with cross-validation
pls_model <- plsr(PH ~ ., data = ols_train_data, validation = "CV")
# Determine the optimal number of components
optimal_ncomp <- selectNcomp(pls_model, method = "onesigma", plot = FALSE)
# Predict on the training and test sets using the optimal number of components
pls_train_predictions <- predict(pls_model, newdata = train_X, ncomp = optimal_ncomp)
pls_test_predictions <- predict(pls_model, newdata = test_X, ncomp = optimal_ncomp)
# Evaluate the PLS model
pls_train_rmse <- RMSE(pls_train_predictions, train_y)
pls_test_rmse <- RMSE(pls_test_predictions, test_y)
pls_train_r2 <- cor(pls_train_predictions, train_y)^2
pls_test_r2 <- cor(pls_test_predictions, test_y)^2
### 2. Principal Component Regression (PCR)
# Fit PCR model with cross-validation
pcr_model <- pcr(PH ~ ., data = ols_train_data, validation = "CV")
# Determine the optimal number of components
optimal_ncomp_pcr <- selectNcomp(pcr_model, method = "onesigma", plot = FALSE)
# Predict on the training and test sets using the optimal number of components
pcr_train_predictions <- predict(pcr_model, newdata = train_X, ncomp = optimal_ncomp_pcr)
pcr_test_predictions <- predict(pcr_model, newdata = test_X, ncomp = optimal_ncomp_pcr)
# Evaluate the PCR model
pcr_train_rmse <- RMSE(pcr_train_predictions, train_y)
pcr_test_rmse <- RMSE(pcr_test_predictions, test_y)
pcr_train_r2 <- cor(pcr_train_predictions, train_y)^2
pcr_test_r2 <- cor(pcr_test_predictions, test_y)^2
### 3. Ridge Regression
ridge_model <- glmnet(train_X, train_y, alpha = 0) # alpha = 0 for Ridge
cv_ridge <- cv.glmnet(train_X, train_y, alpha = 0) # Cross-validation for Ridge
ridge_lambda <- cv_ridge$lambda.min # Optimal lambda
# Predict on the training and test sets
ridge_train_predictions <- predict(cv_ridge, newx = train_X, s = ridge_lambda)
ridge_test_predictions <- predict(cv_ridge, newx = test_X, s = ridge_lambda)
# Evaluate Ridge model
ridge_train_rmse <- RMSE(ridge_train_predictions, train_y)
ridge_test_rmse <- RMSE(ridge_test_predictions, test_y)
ridge_train_r2 <- cor(ridge_train_predictions, train_y)^2
ridge_test_r2 <- cor(ridge_test_predictions, test_y)^2
### 4. Lasso Regression
lasso_model <- glmnet(train_X, train_y, alpha = 1) # alpha = 1 for Lasso
cv_lasso <- cv.glmnet(train_X, train_y, alpha = 1) # Cross-validation for Lasso
lasso_lambda <- cv_lasso$lambda.min # Optimal lambda
# Predict on the training and test sets
lasso_train_predictions <- predict(cv_lasso, newx = train_X, s = lasso_lambda)
lasso_test_predictions <- predict(cv_lasso, newx = test_X, s = lasso_lambda)
# Evaluate Lasso model
lasso_train_rmse <- RMSE(lasso_train_predictions, train_y)
lasso_test_rmse <- RMSE(lasso_test_predictions, test_y)
lasso_train_r2 <- cor(lasso_train_predictions, train_y)^2
lasso_test_r2 <- cor(lasso_test_predictions, test_y)^2
### 5. Elastic Net
elastic_net_model <- train(
x = train_X, y = train_y,
method = "glmnet",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 10),
preProc = c("center", "scale")
)
# Best Elastic Net model
best_enet <- elastic_net_model$finalModel
best_enet_lambda <- elastic_net_model$bestTune$lambda
best_enet_alpha <- elastic_net_model$bestTune$alpha
# Predict on the training and test sets
enet_train_predictions <- predict(best_enet, newx = train_X, s = best_enet_lambda)
enet_test_predictions <- predict(best_enet, newx = test_X, s = best_enet_lambda)
# Evaluate Elastic Net model
enet_train_rmse <- RMSE(enet_train_predictions, train_y)
enet_test_rmse <- RMSE(enet_test_predictions, test_y)
enet_train_r2 <- cor(enet_train_predictions, train_y)^2
enet_test_r2 <- cor(enet_test_predictions, test_y)^2
### Combine Results into a Data Frame
results_df <- data.frame(
Model = c("PLS", "PCR", "Ridge Regression", "Lasso Regression", "Elastic Net"),
Train_R2 = c(pls_train_r2, pcr_train_r2, ridge_train_r2, lasso_train_r2, enet_train_r2),
Test_R2 = c(pls_test_r2, pcr_test_r2, ridge_test_r2, lasso_test_r2, enet_test_r2),
Train_RMSE = c(pls_train_rmse, pcr_train_rmse, ridge_train_rmse, lasso_train_rmse, enet_train_rmse),
Test_RMSE = c(pls_test_rmse, pcr_test_rmse, ridge_test_rmse, lasso_test_rmse, enet_test_rmse)
)
# Sort metrics_df by Test_RMSE in ascending order
results_df <- results_df %>%
arrange(Test_RMSE)
# Display the sorted DataFrame
results_df %>%
kable(
caption = "Summary of Partial Least Squares Models Performance"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE # Set to FALSE for custom widths
) %>%
column_spec(1, width = "5cm") %>% # Adjust the width of the first column
column_spec(2:5, width = "3cm") # Adjust the width of columns 2 to 5
Model | Train_R2 | Test_R2 | Train_RMSE | Test_RMSE |
---|---|---|---|---|
Lasso Regression | 0.4015648 | 0.3683422 | 0.1321666 | 0.1421016 |
Ridge Regression | 0.4003721 | 0.3695003 | 0.1323733 | 0.1421150 |
PCR | 0.3889516 | 0.3659666 | 0.1335502 | 0.1423654 |
PLS | 0.3864426 | 0.3606606 | 0.1338241 | 0.1429587 |
Elastic Net | 0.0442056 | 0.0235254 | 194.6793992 | 194.0983447 |
The diagnostic plots and the cross-validation summary for the Lasso model reveal the following:
# Generate diagnostic plots for the Lasso model
par(mfrow = c(2, 2)) # Set layout for 2x2 plots
# Residuals vs. Fitted
plot(
as.numeric(lasso_test_predictions),
as.numeric(test_y - lasso_test_predictions),
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs Fitted",
pch = 20,
col = "black"
)
abline(h = 0, col = "red", lty = 2)
# Q-Q Plot
qqnorm(as.numeric(test_y - lasso_test_predictions),
main = "Normal Q-Q Plot",
pch = 20,
col = "black")
qqline(as.numeric(test_y - lasso_test_predictions), col = "red")
# Scale-Location Plot
std_residuals <- scale(as.numeric(test_y - lasso_test_predictions))
plot(
as.numeric(lasso_test_predictions),
sqrt(abs(std_residuals)),
xlab = "Fitted Values",
ylab = "√|Standardized Residuals|",
main = "Scale-Location",
pch = 20,
col = "black"
)
abline(h = 0, col = "red", lty = 2)
# Residuals vs Leverage Plot (approximation for test data)
hat_values_test <- diag(test_X %*% solve(t(train_X) %*% train_X) %*% t(test_X)) # Approx. Hat matrix for test
plot(
hat_values_test,
std_residuals,
xlab = "Leverage",
ylab = "Standardized Residuals",
main = "Residuals vs Leverage",
pch = 20,
col = "black"
)
abline(h = 0, col = "red", lty = 2)
par(mfrow = c(1, 1)) # Reset plotting layout
# Print summary of the Lasso model
print(cv_lasso)
##
## Call: cv.glmnet(x = train_X, y = train_y, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.000201 65 0.01792 0.000795 24
## 1se 0.005715 29 0.01866 0.000794 15
# Extract coefficients for Lasso model
lasso_coefficients <- coef(cv_lasso, s = lasso_lambda)
# Create a data frame for variable importance (absolute coefficients)
importance_df_lasso <- data.frame(
Variable = rownames(lasso_coefficients),
Importance = abs(as.numeric(lasso_coefficients))
)
# Remove intercept from the importance calculation
importance_df_lasso <- importance_df_lasso[importance_df_lasso$Variable != "(Intercept)", ]
# Sort by importance
importance_df_lasso <- importance_df_lasso[order(-importance_df_lasso$Importance), ]
# Plot variable importance for Lasso
library(ggplot2)
ggplot(importance_df_lasso, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_point() +
coord_flip() +
labs(title = "Lasso Variable Importance", x = "Variables", y = "Importance") +
theme_minimal()
For Lasso (alpha = 1
) or Elastic Net
(0 < alpha < 1
), use the same approach with
glmnet
.
# Lasso Regression
cv_lasso <- cv.glmnet(train_X, train_y, alpha = 1)
# Elastic Net
cv_enet <- train(
x = train_X, y = train_y,
method = "glmnet",
tuneGrid = expand.grid(
alpha = seq(0, 1, by = 0.1),
lambda = seq(0.001, 0.1, length.out = 10)
),
trControl = trainControl(method = "cv")
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# Extract Best Model and Plot
plot(cv_lasso, main = "Cross-Validation Profile for Lasso Regression")
# Install required packages
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("earth", quietly = TRUE)) {
install.packages("earth")
}
if (!requireNamespace("e1071", quietly = TRUE)) {
install.packages("e1071")
}
if (!requireNamespace("nnet", quietly = TRUE)) {
install.packages("nnet")
}
# Load libraries
library(caret)
library(earth) # For MARS
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
library(e1071) # For SVR
library(nnet) # For ANN
# Set seed for reproducibility
set.seed(123)
# Data preparation
# Assuming gradient_based_data or tree_based_data is already loaded
data <- statistical_models_data
train_indices <- createDataPartition(data$PH, p = 0.8, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
train_X <- train_data[, -which(names(train_data) == "PH")]
train_y <- train_data$PH
test_X <- test_data[, -which(names(test_data) == "PH")]
test_y <- test_data$PH
# Function to compute AIC and BIC
compute_aic_bic <- function(model, test_X, test_y) {
predictions <- predict(model, newdata = test_X)
residuals <- test_y - predictions
n <- length(test_y)
k <- length(coef(model))
mse <- mean(residuals^2)
loglik <- -n / 2 * log(2 * pi * mse) - n / 2
aic <- -2 * loglik + 2 * k
bic <- -2 * loglik + log(n) * k
return(c(AIC = aic, BIC = bic))
}
The optimal K for the KNN model is 7, achieving a Test RMSE of 0.0834 and Test R² of 0.5246. The RMSE profile shows increasing error for K > 7, confirming this as the best configuration. AIC and BIC are -1092.749, indicating a well-balanced model.
# Load necessary libraries
library(ggplot2)
### 1. K-Nearest Neighbors (KNN)
knn_model <- train(
x = train_X, y = train_y,
method = "knn",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 10) # Removed preProc
)
# Generate predictions on the test set
knn_predictions <- predict(knn_model, test_X)
# Calculate RMSE
knn_rmse <- RMSE(knn_predictions, test_y)
# Calculate R-squared
knn_r2 <- cor(knn_predictions, test_y)^2
# Compute AIC and BIC
knn_aic_bic <- compute_aic_bic(knn_model, test_X, test_y)
# Extract the optimal value of K
optimal_k <- knn_model$bestTune$k
# Print the results
cat("KNN Model Results:\n")
## KNN Model Results:
cat("Optimal K:", optimal_k, "\n")
## Optimal K: 5
cat("Test RMSE:", knn_rmse, "\n")
## Test RMSE: 0.1286795
cat("Test R^2:", knn_r2, "\n")
## Test R^2: 0.4892592
cat("AIC:", knn_aic_bic["AIC"], "\n")
## AIC: -647.9108
cat("BIC:", knn_aic_bic["BIC"], "\n")
## BIC: -647.9108
# Extract the results from the KNN model
knn_results_df <- knn_model$results
# Create a plot of RMSE vs k for the KNN model
ggplot(knn_results_df, aes(x = k, y = RMSE)) +
geom_line() +
geom_point(size = 2, color = "blue") +
labs(
title = "RMSE Profiles for the K-Nearest Neighbors Model",
x = "Number of Neighbors (k)",
y = "RMSE (Cross-Validation)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
### Residual analysis of the KNN Model
This residual analysis evaluates the performance and assumptions of the KNN model by examining residual patterns and prediction accuracy. It includes:
The combined analysis provides insights into the KNN model’s predictive reliability and adherence to assumptions.
# Install required packages
if (!requireNamespace("ggplotify", quietly = TRUE)) {
install.packages("ggplotify")
}
# Load necessary libraries
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Prepare the plots
# 1. Residuals vs. Predicted Values
knn_residuals <- test_y - knn_predictions
plot_residuals_vs_predicted <- ggplot(data.frame(Predicted = knn_predictions, Residuals = knn_residuals),
aes(x = Predicted, y = Residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Residuals vs. Predicted Values",
x = "Predicted Values",
y = "Residuals"
) +
theme_minimal()
# 2. Histogram of Residuals
plot_residuals_hist <- ggplot(data.frame(Residuals = knn_residuals), aes(x = Residuals)) +
geom_histogram(bins = 30, color = "black", fill = "black", alpha = 0.7) +
labs(
title = "Distribution of Residuals",
x = "Residuals",
y = "Frequency"
) +
theme_minimal()
# 3. Scatter Plot of Predicted vs. Actual Values
results <- data.frame(Actual = test_y, Predicted = knn_predictions)
plot_predicted_vs_actual <- ggplot(data = results, aes(x = Actual, y = Predicted)) +
geom_point(color = "black", alpha = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 1) +
labs(
title = "KNN Predictions vs. Actual Values",
x = "Actual Values",
y = "Predicted Values"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Display the QQ Plot
qq_plot <- function() {
qqnorm(knn_residuals, main = "QQ Plot of Residuals")
qqline(knn_residuals, col = "red")
}
# Arrange the plots in a grid
grid.arrange(
plot_residuals_vs_predicted,
plot_residuals_hist,
plot_predicted_vs_actual,
ggplotify::as.ggplot(qq_plot), # Convert the base R plot into a ggplot-compatible object
ncol = 2
)
The Support Vector Regression (SVR) model demonstrated the best performance with the following optimal parameters:
The Test RMSE achieved was 0.08066137, and the Test \(R^2\) was 0.5555124, indicating a strong predictive performance. The validation curve illustrates that the RMSE decreases as \(C\) approaches the optimal value but increases beyond it, affirming the selected optimal configuration. Statistical evaluation metrics further support the model’s fit with calculated AIC, AICc, and BIC values.
# Load necessary libraries
library(ggplot2)
### 2. Support Vector Regression (SVR)
svr_model <- train(
x = train_X, y = train_y,
method = "svmRadial",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 10) # Removed preProc
)
# Generate predictions on the test set
svr_predictions <- predict(svr_model, test_X)
# Calculate performance metrics
svr_rmse <- RMSE(svr_predictions, test_y)
svr_r2 <- cor(svr_predictions, test_y)^2
# AIC/BIC computation function for SVR
compute_aic_bic_svr <- function(model, test_X, test_y) {
predictions <- predict(model, newdata = test_X)
residuals <- test_y - predictions
n <- length(test_y) # Number of observations
mse <- mean(residuals^2) # Mean squared error
# Use the number of support vectors as the number of parameters (proxy)
k <- length(model$finalModel@alpha) # Number of support vectors
loglik <- -n / 2 * log(2 * pi * mse) - n / 2 # Log-likelihood
aic <- -2 * loglik + 2 * k # AIC
aicc <- aic + (2 * k * (k + 1)) / (n - k - 1) # Corrected AIC
bic <- -2 * loglik + log(n) * k # BIC
return(c(AIC = aic, AICc = aicc, BIC = bic))
}
# Compute AIC, AICc, and BIC for the SVR model
svr_aic_bic <- compute_aic_bic_svr(svr_model, test_X, test_y)
# Extract optimal parameters from the model
optimal_params <- svr_model$bestTune # Best hyperparameters for SVR
# Display SVR results
cat("SVR Results:\n")
## SVR Results:
cat("Optimal Parameters:\n")
## Optimal Parameters:
print(optimal_params)
## sigma C
## 5 0.02667853 4
cat("Test RMSE:", svr_rmse, "\n")
## Test RMSE: 0.1216659
cat("Test R^2:", svr_r2, "\n")
## Test R^2: 0.53815
cat("AIC:", svr_aic_bic["AIC"], "\n")
## AIC: 2794.586
cat("AICc:", svr_aic_bic["AICc"], "\n")
## AICc: -2155.737
cat("BIC:", svr_aic_bic["BIC"], "\n")
## BIC: 10215.07
# Extract the results from the SVR model
results_df <- svr_model$results
# Create a plot of RMSE vs Cost (C) for the SVR model
ggplot(results_df, aes(x = C, y = RMSE, color = as.factor(sigma))) +
geom_line() +
geom_point(size = 2) +
labs(
title = "RMSE Profiles for the Support Vector Regression Model",
x = "Cost (C)",
y = "RMSE (Cross-Validation)",
color = "Sigma (σ)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "right"
)
Resudual analysis of the Support Vector Regression Model
Residual analysis evaluates the performance and assumptions of a fitted Support Vector Regression (SVR) model. Below are key observations based on the generated plots:
These insights highlight areas of improvement for fine-tuning the SVR model or feature engineering.
# Install required packages
if (!requireNamespace("ggplotify", quietly = TRUE)) {
install.packages("ggplotify")
}
# Load necessary libraries
library(ggplot2)
library(gridExtra)
# Calculate residuals
svr_residuals <- test_y - svr_predictions
plot_residuals_vs_predicted <- ggplot(data.frame(Predicted = svr_predictions, Residuals = svr_residuals),
aes(x = Predicted, y = Residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Residuals vs. Predicted Values",
x = "Predicted Values",
y = "Residuals"
) +
theme_minimal()
# 2. Histogram of Residuals
plot_residuals_hist <- ggplot(data.frame(Residuals = svr_residuals), aes(x = Residuals)) +
geom_histogram(bins = 30, color = "black", fill = "black", alpha = 0.7) +
labs(
title = "Distribution of Residuals",
x = "Residuals",
y = "Frequency"
) +
theme_minimal()
# 3. Scatter Plot of Predicted vs. Actual Values
results <- data.frame(Actual = test_y, Predicted = svr_predictions)
plot_predicted_vs_actual <- ggplot(data = results, aes(x = Actual, y = Predicted)) +
geom_point(color = "black", alpha = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 1) +
labs(
title = "SVR Predictions vs. Actual Values",
x = "Actual Values",
y = "Predicted Values"
) +
theme_minimal()
# Display the QQ Plot
qq_plot <- function() {
qqnorm(svr_residuals, main = "QQ Plot of Residuals")
qqline(svr_residuals, col = "red")
}
# Arrange the plots in a grid
grid.arrange(
plot_residuals_vs_predicted,
plot_residuals_hist,
plot_predicted_vs_actual,
ggplotify::as.ggplot(qq_plot),
ncol = 2
)
The optimal number of hidden units for the ANN is approximately 12, minimizing RMSE. The final ANN model, with 15 hidden units and decay of 0.0178, achieved:
This configuration balances accuracy and generalization effectively.
### Artificial Neural Network (ANN) Model with Validation Curve
library(caret)
library(nnet)
library(ggplot2)
# Set seed for reproducibility
set.seed(123)
# Define a grid for number of hidden units (size) using the optimal decay = 0.01778279
validation_grid <- expand.grid(size = seq(2, 18, by = 1), decay = 0.01778279)
# Train ANN model with the validation grid
ann_model <- train(
x = train_X, y = train_y,
method = "nnet",
tuneGrid = validation_grid, # Test varying hidden units
trControl = trainControl(method = "cv", number = 10), # 10-fold cross-validation
linout = TRUE, # Linear output for regression tasks
trace = FALSE
)
# Generate predictions on the test set
ann_predictions <- predict(ann_model, test_X)
# Calculate performance metrics
ann_rmse <- RMSE(ann_predictions, test_y)
ann_r2 <- cor(ann_predictions, test_y)^2
# Compute AIC, AICc, and BIC for ANN
compute_aic_bic <- function(model, test_X, test_y) {
predictions <- predict(model, newdata = test_X)
residuals <- test_y - predictions
n <- length(test_y) # Number of observations
k <- length(coef(model$finalModel)) # Number of parameters in the final model
mse <- mean(residuals^2) # Mean squared error
loglik <- -n / 2 * log(2 * pi * mse) - n / 2 # Log-likelihood
aic <- -2 * loglik + 2 * k # AIC
aicc <- aic + (2 * k * (k + 1)) / (n - k - 1) # Corrected AIC
bic <- -2 * loglik + log(n) * k # BIC
return(c(AIC = aic, AICc = aicc, BIC = bic))
}
# Compute AIC, AICc, and BIC
ann_aic_bic <- compute_aic_bic(ann_model, test_X, test_y)
# Extract the optimal configuration from ann_model$results
optimal_config <- ann_model$results[which.min(ann_model$results$RMSE), ]
# Extract the optimal number of hidden units
best_hidden_units <- optimal_config$size
# Display results
cat("ANN Results:\n")
## ANN Results:
cat("Optimal Number of Hidden Units (Programmatically Extracted):", best_hidden_units, "\n")
## Optimal Number of Hidden Units (Programmatically Extracted): 9
cat("Train RMSE:", ann_model$results$RMSE[which.min(ann_model$results$RMSE)], "\n")
## Train RMSE: 0.1248814
cat("Test RMSE:", ann_rmse, "\n")
## Test RMSE: 0.1276438
cat("Train R^2:", ann_model$results$Rsquared[which.min(ann_model$results$RMSE)], "\n")
## Train R^2: 0.4810389
cat("Test R^2:", ann_r2, "\n")
## Test R^2: 0.4972056
cat("AIC:", ann_aic_bic["AIC"], "\n")
## AIC: -186.2019
cat("AICc:", ann_aic_bic["AICc"], "\n")
## AICc: 214.2313
cat("BIC:", ann_aic_bic["BIC"], "\n")
## BIC: 810.263
# Extract the results from the model
results_df <- ann_model$results
# Plot RMSE vs. Number of Hidden Units
ggplot(results_df, aes(x = size, y = RMSE)) +
geom_line(color = "blue") +
geom_point(size = 2, color = "red") +
labs(
title = "Validation Curve: RMSE vs. # Hidden Units (ANN)",
x = "# Hidden Units",
y = "RMSE (Cross-Validation)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Residual analysis evaluates the performance and assumptions of a fitted Support Vector Regression (SVR) model. Below are key observations based on the generated plots:
These insights highlight areas of improvement for fine-tuning the SVR model or feature engineering.
# Install required packages
if (!requireNamespace("ggplotify", quietly = TRUE)) {
install.packages("ggplotify")
}
# Load necessary libraries
library(ggplot2)
library(gridExtra)
# Prepare the plots
# 1. Residuals vs. Predicted Values
ann_residuals <- test_y - ann_predictions
plot_residuals_vs_predicted <- ggplot(data.frame(Predicted = ann_predictions, Residuals = ann_residuals),
aes(x = Predicted, y = Residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Residuals vs. Predicted Values",
x = "Predicted Values",
y = "Residuals"
) +
theme_minimal()
# 2. Histogram of Residuals
plot_residuals_hist <- ggplot(data.frame(Residuals = ann_residuals), aes(x = Residuals)) +
geom_histogram(bins = 30, color = "black", fill = "black", alpha = 0.7) +
labs(
title = "Distribution of Residuals",
x = "Residuals",
y = "Frequency"
) +
theme_minimal()
# 3. Scatter Plot of Predicted vs. Actual Values
results <- data.frame(Actual = test_y, Predicted = ann_predictions)
plot_predicted_vs_actual <- ggplot(data = results, aes(x = Actual, y = Predicted)) +
geom_point(color = "black", alpha = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red", size = 1) +
labs(
title = "ANN Predictions vs. Actual Values",
x = "Actual Values",
y = "Predicted Values"
) +
theme_minimal()
# Display the QQ Plot
qq_plot <- function() {
qqnorm(ann_residuals, main = "QQ Plot of Residuals")
qqline(ann_residuals, col = "red")
}
# Arrange the plots in a grid
grid.arrange(
plot_residuals_vs_predicted,
plot_residuals_hist,
plot_predicted_vs_actual,
ggplotify::as.ggplot(qq_plot),
ncol = 2
)
# Load necessary libraries
library(ggplot2)
### 4. Multivariate Adaptive Regression Splines (MARS)
# Assuming tree_based_data is used for MARS
data <- tree_based_data # Switch to untransformed data for MARS
mars_train_indices <- createDataPartition(data$PH, p = 0.8, list = FALSE)
mars_train_data <- data[train_indices, ]
mars_test_data <- data[-train_indices, ]
mars_train_X <- mars_train_data[, setdiff(names(mars_train_data), "PH")]
mars_train_y <- mars_train_data$PH
mars_test_X <- mars_test_data[, setdiff(names(mars_test_data), "PH")]
mars_test_y <- mars_test_data$PH
mars_model <- train(
x = mars_train_X, y = mars_train_y,
method = "earth",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 10)
)
# Generate predictions on the test set
mars_predictions <- predict(mars_model, mars_test_X)
# Calculate performance metrics
mars_rmse <- RMSE(mars_predictions, mars_test_y)
mars_r2 <- cor(mars_predictions, mars_test_y)^2
# Compute AIC and BIC for MARS
mars_aic_bic <- compute_aic_bic(mars_model, mars_test_X, mars_test_y)
# Extract the optimal configuration from mars_model$results
optimal_config <- mars_model$results[which.min(mars_model$results$RMSE), ]
# Extract the optimal number of basis functions (if applicable)
best_basis_functions <- optimal_config$nprune
# Display results
cat("MARS Model Results:\n")
## MARS Model Results:
cat("Optimal Number of Basis Functions:", best_basis_functions, "\n")
## Optimal Number of Basis Functions: 18
cat("Train RMSE:", mars_model$results$RMSE[which.min(mars_model$results$RMSE)], "\n")
## Train RMSE: 0.1288138
cat("Test RMSE:", mars_rmse, "\n")
## Test RMSE: 0.132735
cat("Train R^2:", mars_model$results$Rsquared[which.min(mars_model$results$RMSE)], "\n")
## Train R^2: 0.4347175
cat("Test R^2:", mars_r2, "\n")
## Test R^2: 0.4500877
cat("AIC:", mars_aic_bic["AIC"], "\n")
## AIC: -584.0745
cat("AICc:", mars_aic_bic["AICc"], "\n")
## AICc: -582.9777
cat("BIC:", mars_aic_bic["BIC"], "\n")
## BIC: -516.2301
# Extract the results from the MARS model
mars_results_df <- mars_model$results
# Create a plot of RMSE vs. nprune for each degree
ggplot(mars_results_df, aes(x = nprune, y = RMSE, color = factor(degree))) +
geom_line() +
geom_point(size = 2) +
labs(
title = "RMSE Profiles for the MARS Model",
x = "Number of Pruned Terms (nprune)",
y = "RMSE (Cross-Validation)",
color = "Degree"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5)
)
# Combine results into a data frame
results_df <- data.frame(
Model = c("KNN", "MARS", "SVR", "ANN"),
Train_RMSE = c(knn_model$results$RMSE[which.min(knn_model$results$RMSE)],
mars_model$results$RMSE[which.min(mars_model$results$RMSE)],
svr_model$results$RMSE[which.min(svr_model$results$RMSE)],
ann_model$results$RMSE[which.min(ann_model$results$RMSE)]
),
Test_RMSE = c(knn_rmse, mars_rmse, svr_rmse, ann_rmse),
Train_R2 = c(knn_model$results$Rsquared[which.min(knn_model$results$RMSE)],
mars_model$results$Rsquared[which.min(mars_model$results$RMSE)],
svr_model$results$Rsquared[which.min(svr_model$results$RMSE)],
ann_model$results$Rsquared[which.min(ann_model$results$RMSE)]
),
Test_R2 = c(knn_r2, mars_r2, svr_r2, ann_r2),
AIC = c(knn_aic_bic["AIC"], mars_aic_bic["AIC"], svr_aic_bic["AIC"], ann_aic_bic["AIC"]),
BIC = c(knn_aic_bic["BIC"], mars_aic_bic["BIC"], svr_aic_bic["BIC"], ann_aic_bic["BIC"])
)
# Sort by Test_RMSE
results_df <- results_df[order(results_df$Test_RMSE), ]
# Display results
results_df %>%
kable(
caption = "Summary of Non-linear Models Performance"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE # Set to FALSE for custom widths
) %>%
column_spec(1, width = "5cm") %>% # Adjust the width of the first column
column_spec(2:5, width = "3cm") # Adjust the width of columns 2 to 5
Model | Train_RMSE | Test_RMSE | Train_R2 | Test_R2 | AIC | BIC | |
---|---|---|---|---|---|---|---|
3 | SVR | 0.1184443 | 0.1216659 | 0.5242925 | 0.5381500 | 2794.5859 | 10215.0686 |
4 | ANN | 0.1248814 | 0.1276438 | 0.4810389 | 0.4972056 | -186.2019 | 810.2630 |
1 | KNN | 0.1258932 | 0.1286795 | 0.4680361 | 0.4892592 | -647.9108 | -647.9108 |
2 | MARS | 0.1288138 | 0.1327350 | 0.4347175 | 0.4500877 | -584.0745 | -516.2301 |
# Install required packages
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("ipred", quietly = TRUE)) {
install.packages("ipred")
}
if (!requireNamespace("randomForest", quietly = TRUE)) {
install.packages("randomForest")
}
if (!requireNamespace("gbm", quietly = TRUE)) {
install.packages("gbm")
}
if (!requireNamespace("xgboost", quietly = TRUE)) {
install.packages("xgboost")
}
# Install and load doParallel
if (!requireNamespace("doParallel", quietly = TRUE)) {
install.packages("doParallel")
}
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
# Load libraries
library(caret) # For model training and cross-validation
library(ipred) # For Bagged Trees
library(randomForest) # For Random Forest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(gbm) # For Gradient Boosting Machine
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(xgboost) # For Extreme Gradient Boosting
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
# Set seed for reproducibility
set.seed(123)
# Data preparation
# Using tree_based_data for modeling
data <- tree_based_data
# Split the data into training and testing sets (80% training, 20% testing)
train_indices <- createDataPartition(data$PH, p = 0.8, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
# Separate predictors (X) and response (y)
train_X <- train_data[, -which(names(train_data) == "PH")]
train_y <- train_data$PH
test_X <- test_data[, -which(names(test_data) == "PH")]
test_y <- test_data$PH
# Function to compute AIC, BIC, and AICc
compute_aic_bic <- function(model, test_X, test_y) {
predictions <- predict(model, newdata = test_X)
residuals <- test_y - predictions
n <- length(test_y) # Sample size
k <- length(coef(model)) # Number of parameters in the model
mse <- mean(residuals^2) # Mean squared error
loglik <- -n / 2 * log(2 * pi * mse) - n / 2 # Log-likelihood
# Compute AIC
aic <- -2 * loglik + 2 * k
# Compute BIC
bic <- -2 * loglik + log(n) * k
# Compute AICc (corrected AIC)
if (n - k - 1 > 0) {
aicc <- aic + (2 * k * (k + 1)) / (n - k - 1)
} else {
aicc <- NA # AICc not defined if n - k - 1 <= 0
}
return(c(AIC = aic, BIC = bic, AICc = aicc))
}
#### Bagged Trees Model
if (!requireNamespace("rpart", quietly = TRUE)) {
install.packages("rpart")
}
library(ipred)
library(rpart)
library(ggplot2)
# Define a grid of parameters to tune
tune_grid <- expand.grid(
minsplit = c(2, 5, 10), # Minimum samples required for a split
maxdepth = c(5, 10, 15), # Maximum depth of trees
nbagg = c(10, 20, 30) # Number of bootstrap trees
)
# Placeholder for results
results <- data.frame()
# Loop through parameter combinations
for (i in 1:nrow(tune_grid)) {
set.seed(123)
# Extract current parameter values
params <- tune_grid[i, ]
# Train Bagged Trees with specified parameters
bagged_model <- bagging(
train_y ~ ., data = data.frame(train_X, train_y),
nbagg = params$nbagg,
control = rpart.control(minsplit = params$minsplit, maxdepth = params$maxdepth)
)
# Predict on the test set
predictions <- predict(bagged_model, newdata = test_X)
bagged_rmse <- RMSE(predictions, test_y)
bagged_r2 <- cor(predictions, test_y)^2
# Store results
results <- rbind(
results,
data.frame(
minsplit = params$minsplit,
maxdepth = params$maxdepth,
nbagg = params$nbagg,
RMSE = bagged_rmse,
R2 = bagged_r2
)
)
}
# Identify the best configuration
best_config <- results[which.min(results$RMSE), ]
cat("Best Configuration:\n")
## Best Configuration:
print(best_config)
## minsplit maxdepth nbagg RMSE R2
## 13 2 10 20 0.1248368 0.5216001
# Compute AIC and BIC for the best bagged model
compute_aic_bic_bagged <- function(model, test_X, test_y) {
predictions <- predict(model, newdata = test_X)
residuals <- test_y - predictions
n <- length(test_y) # Number of observations
k <- length(model$mtrees) # Proxy for number of parameters (number of trees)
mse <- mean(residuals^2) # Mean squared error
loglik <- -n / 2 * log(2 * pi * mse) - n / 2 # Log-likelihood
aic <- -2 * loglik + 2 * k # AIC
aicc <- aic + (2 * k * (k + 1)) / (n - k - 1) # Corrected AIC
bic <- -2 * loglik + log(n) * k # BIC
return(c(AIC = aic, AICc = aicc, BIC = bic))
}
# Train the best bagged model based on the optimal configuration
final_bagged_model <- bagging(
train_y ~ ., data = data.frame(train_X, train_y),
nbagg = best_config$nbagg,
control = rpart.control(minsplit = best_config$minsplit, maxdepth = best_config$maxdepth)
)
# Compute AIC and BIC for the final model
bagged_aic_bic <- compute_aic_bic_bagged(final_bagged_model, test_X, test_y)
# Display the best configuration and metrics
cat("\nOptimal Bagged Trees Model Configuration:\n")
##
## Optimal Bagged Trees Model Configuration:
cat("Min Split:", best_config$minsplit, "\n")
## Min Split: 2
cat("Max Depth:", best_config$maxdepth, "\n")
## Max Depth: 10
cat("Number of Bootstrap Trees:", best_config$nbagg, "\n")
## Number of Bootstrap Trees: 20
cat("Optimal RMSE:", best_config$RMSE, "\n")
## Optimal RMSE: 0.1248368
cat("Optimal R2:", best_config$R2, "\n")
## Optimal R2: 0.5216001
cat("AIC:", bagged_aic_bic["AIC"], "\n")
## AIC: -630.8485
cat("AICc:", bagged_aic_bic["AICc"], "\n")
## AICc: -629.1411
cat("BIC:", bagged_aic_bic["BIC"], "\n")
## BIC: -546.0429
# Enhanced plot with proper grouping, facetting, and distinct colors
ggplot(results, aes(
x = nbagg,
y = RMSE,
group = interaction(minsplit, maxdepth), # Group by minsplit and maxdepth
color = as.factor(minsplit) # Apply color to minsplit
)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
facet_wrap(~ maxdepth, labeller = label_both) +
scale_color_manual(
values = c("2" = "red", "5" = "green", "10" = "blue") # Manually assign colors
) +
labs(
title = "RMSE Profiles for Tuned Bagged Trees Model",
x = "Number of Bootstrap Trees (nbagg)",
y = "RMSE (Cross-Validation)",
color = "Min Split"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "right"
)
### Random Forest Model
library(randomForest)
library(doParallel)
library(ggplot2)
# Set up parallel processing
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
set.seed(123)
# Train Random Forest model
rf_model <- train(
x = train_X, y = train_y,
method = "rf",
tuneLength = 10, # Automatically tests 10 different mtry values
trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
)
# Generate predictions
rf_predictions <- predict(rf_model, test_X)
rf_rmse <- RMSE(rf_predictions, test_y)
rf_r2 <- cor(rf_predictions, test_y)^2
# Extract optimal parameters
optimal_params_rf <- rf_model$bestTune
# Compute AIC and BIC for Random Forest
compute_aic_bic_rf <- function(model, test_X, test_y) {
predictions <- predict(model, newdata = test_X)
residuals <- test_y - predictions
n <- length(test_y) # Number of observations
k <- model$finalModel$ntree # Use the number of trees as a proxy for model complexity
mse <- mean(residuals^2) # Mean squared error
loglik <- -n / 2 * log(2 * pi * mse) - n / 2 # Log-likelihood
aic <- -2 * loglik + 2 * k # AIC
aicc <- aic + (2 * k * (k + 1)) / (n - k - 1) # Corrected AIC
bic <- -2 * loglik + log(n) * k # BIC
return(c(AIC = aic, AICc = aicc, BIC = bic))
}
rf_aic_bic <- compute_aic_bic_rf(rf_model, test_X, test_y)
# Stop parallel processing
stopCluster(cl)
registerDoSEQ()
# Plot RMSE vs. mtry (Number of Features Tried at Each Split)
rf_results <- rf_model$results
ggplot(rf_results, aes(x = mtry, y = RMSE)) +
geom_line(color = "blue") +
geom_point(size = 2) +
labs(
title = "RMSE Profiles for the Random Forest Model",
x = "mtry (Number of Features Tried at Each Split)",
y = "RMSE (Cross-Validation)"
) +
theme_minimal()
# Print Results
cat("Random Forest Results:\n")
## Random Forest Results:
cat("Optimal Parameters (mtry):", optimal_params_rf$mtry, "\n")
## Optimal Parameters (mtry): 22
cat("Test RMSE:", rf_rmse, "\n")
## Test RMSE: 0.1000187
cat("Test R^2:", rf_r2, "\n")
## Test R^2: 0.701144
cat("AIC:", rf_aic_bic["AIC"], "\n")
## AIC: 93.57
cat("AICc:", rf_aic_bic["AICc"], "\n")
## AICc: 41843.57
cat("BIC:", rf_aic_bic["BIC"], "\n")
## BIC: 2213.708
### Boosting (GBM) Model
library(gbm)
library(doParallel)
library(ggplot2)
# Set up parallel processing
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
set.seed(123)
# Train the GBM model
boost_model <- train(
x = train_X, y = train_y,
method = "gbm",
tuneLength = 10, # Automatically tunes over 10 combinations of parameters
trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
verbose = FALSE # Suppress training output
)
# Generate predictions on the test set
boost_predictions <- predict(boost_model, test_X)
boost_rmse <- RMSE(boost_predictions, test_y)
boost_r2 <- cor(boost_predictions, test_y)^2
# Extract optimal parameters from the GBM model
optimal_params_boost <- boost_model$bestTune
# Compute AIC and BIC for Boosting (GBM)
compute_aic_bic_gbm <- function(model, test_X, test_y) {
predictions <- predict(model, newdata = test_X)
residuals <- test_y - predictions
n <- length(test_y) # Number of observations
k <- model$finalModel$n.trees # Use number of trees as a proxy for model complexity
mse <- mean(residuals^2) # Mean squared error
loglik <- -n / 2 * log(2 * pi * mse) - n / 2 # Log-likelihood
aic <- -2 * loglik + 2 * k # AIC
aicc <- aic + (2 * k * (k + 1)) / (n - k - 1) # Corrected AIC
bic <- -2 * loglik + log(n) * k # BIC
return(c(AIC = aic, AICc = aicc, BIC = bic))
}
boost_aic_bic <- compute_aic_bic_gbm(boost_model, test_X, test_y)
# Stop parallel processing
stopCluster(cl)
registerDoSEQ()
# Plot RMSE vs. Number of Trees, Shrinkage, and Depth
boost_results <- boost_model$results
ggplot(boost_results, aes(x = n.trees, y = RMSE, color = as.factor(interaction(shrinkage, interaction.depth)))) +
geom_line() +
geom_point(size = 2) +
labs(
title = "RMSE Profiles for the Boosting (GBM) Model",
x = "Number of Trees",
y = "RMSE (Cross-Validation)",
color = "Shrinkage/Depth"
) +
theme_minimal() +
theme(legend.position = "right")
# Display results
cat("Boosting (GBM) Results:\n")
## Boosting (GBM) Results:
cat("Optimal Parameters:\n")
## Optimal Parameters:
print(optimal_params_boost)
## n.trees interaction.depth shrinkage n.minobsinnode
## 99 450 10 0.1 10
cat("Test RMSE:", boost_rmse, "\n")
## Test RMSE: 0.1075398
cat("Test R^2:", boost_r2, "\n")
## Test R^2: 0.6380685
cat("AIC:", boost_aic_bic["AIC"], "\n")
## AIC: 67.9596
cat("AICc:", boost_aic_bic["AICc"], "\n")
## AICc: 6614.734
cat("BIC:", boost_aic_bic["BIC"], "\n")
## BIC: 1976.084
### XGBoost Search for Boosting Rounds
library(xgboost)
library(caret)
library(doParallel)
library(ggplot2)
# Set seed for reproducibility
set.seed(123)
# Data preparation
data <- gradient_based_data
train_indices <- createDataPartition(data$PH, p = 0.8, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
train_X <- train_data[, -which(names(train_data) == "PH")]
train_y <- train_data$PH
test_X <- test_data[, -which(names(test_data) == "PH")]
test_y <- test_data$PH
# Ensure train_X and test_X are data frames
if (is.list(train_X)) train_X <- as.data.frame(train_X)
if (is.list(test_X)) test_X <- as.data.frame(test_X)
# Set up parallel processing
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
# Check for missing or infinite values
if (any(is.na(train_X)) || any(is.na(train_y)) ||
any(is.infinite(as.matrix(train_X))) || any(is.infinite(train_y))) {
stop("Training data contains missing or infinite values.")
}
# Refined search with fixed optimal max_depth and eta from earlier run
optimal_max_depth <- 9
optimal_eta <- 0.1
# Set up a granular grid for nrounds
refined_tune_grid <- expand.grid(
nrounds = seq(50, 200, by = 25), # More granular values for nrounds
max_depth = optimal_max_depth,
eta = optimal_eta,
gamma = 0,
colsample_bytree = 0.8,
min_child_weight = 1,
subsample = 0.8
)
# Train the model with the refined grid
xgb_refined_model <- tryCatch({
train(
x = train_X, y = train_y,
method = "xgbTree",
tuneGrid = refined_tune_grid,
trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE, verboseIter = TRUE)
)
}, error = function(e) {
stop("Error during model training: ", e$message)
})
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 200, max_depth = 9, eta = 0.1, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1, subsample = 0.8 on full training set
# Check if the model trained successfully
if (!exists("xgb_refined_model")) stop("XGBoost training failed.")
# Predictions and evaluation
xgb_refined_predictions <- predict(xgb_refined_model, test_X)
xgb_refined_rmse <- RMSE(xgb_refined_predictions, test_y)
xgb_refined_r2 <- cor(xgb_refined_predictions, test_y)^2
# Compute AIC and BIC for XGBoost
compute_aic_bic_xgb <- function(model, test_X, test_y) {
predictions <- predict(model, newdata = test_X)
residuals <- test_y - predictions
n <- length(test_y) # Number of observations
k <- model$bestTune$nrounds # Number of boosting rounds as proxy for parameters
mse <- mean(residuals^2) # Mean squared error
loglik <- -n / 2 * log(2 * pi * mse) - n / 2 # Log-likelihood
aic <- -2 * loglik + 2 * k # AIC
aicc <- aic + (2 * k * (k + 1)) / (n - k - 1) # Corrected AIC
bic <- -2 * loglik + log(n) * k # BIC
return(c(AIC = aic, AICc = aicc, BIC = bic))
}
xgb_aic_bic <- compute_aic_bic_xgb(xgb_refined_model, test_X, test_y)
# Stop parallel processing
stopCluster(cl)
registerDoSEQ()
# Display refined results
cat("Refined XGBoost Results:\n")
## Refined XGBoost Results:
cat("Test RMSE:", xgb_refined_rmse, "\n")
## Test RMSE: 0.1099225
cat("Test R^2:", xgb_refined_r2, "\n")
## Test R^2: 0.6223548
cat("AIC:", xgb_aic_bic["AIC"], "\n")
## AIC: -409.5562
cat("AICc:", xgb_aic_bic["AICc"], "\n")
## AICc: -151.8639
cat("BIC:", xgb_aic_bic["BIC"], "\n")
## BIC: 438.499
# Plot RMSE vs. nrounds
results_df <- xgb_refined_model$results
if (nrow(results_df) > 0) {
ggplot(results_df, aes(x = nrounds, y = RMSE)) +
geom_line(color = "blue") +
geom_point(size = 2, color = "red") +
labs(
title = "RMSE Profiles for XGBoost (Refined Search)",
x = "Number of Boosting Rounds (nrounds)",
y = "RMSE (Cross-Validation)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
} else {
cat("No valid results available to plot.")
}
# Combine results into a data frame
results_df <- data.frame(
Model = c("Bagged Trees", "Random Forest", "GBM", "XGBoost"),
Test_RMSE = c(bagged_rmse, rf_rmse, boost_rmse, xgb_refined_rmse),
Test_R2 = c(bagged_r2, rf_r2, boost_r2, xgb_refined_r2),
AIC = c(bagged_aic_bic["AIC"], rf_aic_bic["AIC"], boost_aic_bic["AIC"], xgb_aic_bic["AIC"]),
BIC = c(bagged_aic_bic["BIC"], rf_aic_bic["BIC"], boost_aic_bic["BIC"], xgb_aic_bic["BIC"])
)
# Sort by Test_RMSE
results_df <- results_df[order(results_df$Test_RMSE), ]
# Display results
results_df %>%
kable(
caption = "Summary of Tree Models Performance"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE # Set to FALSE for custom widths
) %>%
column_spec(1, width = "5cm") %>% # Adjust the width of the first column
column_spec(2:5, width = "3cm") # Adjust the width of columns 2 to 5
Model | Test_RMSE | Test_R2 | AIC | BIC | |
---|---|---|---|---|---|
2 | Random Forest | 0.1000187 | 0.7011440 | 93.5700 | 2213.7079 |
3 | GBM | 0.1075398 | 0.6380685 | 67.9596 | 1976.0837 |
4 | XGBoost | 0.1099225 | 0.6223548 | -409.5562 | 438.4990 |
1 | Bagged Trees | 0.1251824 | 0.5186901 | -630.8485 | -546.0429 |
Based on the Consolidated Model Performance Metrics, the Random Forest model is the best overall performer among tree-based models. It achieves: - Highest Predictive Accuracy: Test_R² of 0.7011 and Test_RMSE of 0.1000, indicating its superior ability to explain variability and predict accurately on unseen data. - Robustness: Minimal overfitting, as reflected in its aligned Train_R² and Test_R², as well as consistent RMSE values across training and testing. - Realistic Predictions: The Random Forest model predicts PH values that are highly consistent with expected ranges, making it ideal for practical applications.
Although Random Forest is the top choice, other tree-based models might be considered depending on the dataset or specific requirements:
While the Interaction Term model and SVR perform well on test metrics, they fail to predict realistic PH values. These unrealistic predictions make them unsuitable for practical use cases, where interpretability and alignment with expected ranges are critical.
By focusing solely on tree-based models, this selection ensures the emphasis remains on methods that consistently deliver realistic predictions while accommodating different modeling priorities.
# Combine results into a single data frame
all_models_performance <- data.frame(
Model_Type = c(
rep("OLS", 8), # OLS Models
rep("PLS", 5), # Partial Least Squares Models
rep("Non-Linear", 4), # Non-linear Models
rep("Tree-Based", 4) # Tree-Based Models
),
Model = c(
# OLS Models
"MLR (All Features Included)",
"Forward Selection",
"Backward Elimination",
"Stepwise Selection",
"Recursive Feature Elimination (RFE)",
"Interaction Terms",
"Polynomial Features",
"Feature Engineering",
# PLS Models
"PLS",
"PCR",
"Ridge Regression",
"Lasso Regression",
"Elastic Net",
# Non-linear Models
"KNN",
"MARS",
"SVR",
"ANN",
# Tree-Based Models
"Bagged Trees",
"Random Forest",
"GBM",
"XGBoost"
),
Train_R2 = c(
# OLS Models
summary(full_model)$r.squared,
forward_metrics$Train_R2,
backward_metrics$Train_R2,
stepwise_metrics$Train_R2,
rfe_metrics$Train_R2,
interaction_metrics$Train_R2,
polynomial_metrics$Train_R2,
fe_metrics$Train_R2,
# PLS Models
pls_train_r2,
pcr_train_r2,
ridge_train_r2,
lasso_train_r2,
enet_train_r2,
# Non-linear Models
knn_model$results$Rsquared[which.min(knn_model$results$RMSE)],
mars_model$results$Rsquared[which.min(mars_model$results$RMSE)],
svr_model$results$Rsquared[which.min(svr_model$results$RMSE)],
ann_model$results$Rsquared[which.min(ann_model$results$RMSE)],
# Tree-Based Models
bagged_r2,
rf_r2,
boost_r2,
xgb_refined_r2
),
Test_R2 = c(
# OLS Models
evaluate_model(full_model, train_data, test_data)$Test_R2,
forward_metrics$Test_R2,
backward_metrics$Test_R2,
stepwise_metrics$Test_R2,
rfe_metrics$Test_R2,
interaction_metrics$Test_R2,
polynomial_metrics$Test_R2,
fe_metrics$Test_R2,
# PLS Models
pls_test_r2,
pcr_test_r2,
ridge_test_r2,
lasso_test_r2,
enet_test_r2,
# Non-linear Models
knn_r2,
mars_r2,
svr_r2,
ann_r2,
# Tree-Based Models
bagged_r2,
rf_r2,
boost_r2,
xgb_refined_r2
),
Train_RMSE = c(
# OLS Models
evaluate_model(full_model, train_data, test_data)$Train_RMSE,
forward_metrics$Train_RMSE,
backward_metrics$Train_RMSE,
stepwise_metrics$Train_RMSE,
rfe_metrics$Train_RMSE,
interaction_metrics$Train_RMSE,
polynomial_metrics$Train_RMSE,
fe_metrics$Train_RMSE,
# PLS Models
pls_train_rmse,
pcr_train_rmse,
ridge_train_rmse,
lasso_train_rmse,
enet_train_rmse,
# Non-linear Models
knn_model$results$RMSE[which.min(knn_model$results$RMSE)],
mars_model$results$RMSE[which.min(mars_model$results$RMSE)],
svr_model$results$RMSE[which.min(svr_model$results$RMSE)],
ann_model$results$RMSE[which.min(ann_model$results$RMSE)],
# Tree-Based Models
bagged_rmse,
rf_rmse,
boost_rmse,
xgb_refined_rmse
),
Test_RMSE = c(
# OLS Models
evaluate_model(full_model, train_data, test_data)$Test_RMSE,
forward_metrics$Test_RMSE,
backward_metrics$Test_RMSE,
stepwise_metrics$Test_RMSE,
rfe_metrics$Test_RMSE,
interaction_metrics$Test_RMSE,
polynomial_metrics$Test_RMSE,
fe_metrics$Test_RMSE,
# PLS Models
pls_test_rmse,
pcr_test_rmse,
ridge_test_rmse,
lasso_test_rmse,
enet_test_rmse,
# Non-linear Models
knn_rmse,
mars_rmse,
svr_rmse,
ann_rmse,
# Tree-Based Models
bagged_rmse,
rf_rmse,
boost_rmse,
xgb_refined_rmse
),
AIC = c(
rep(NA, 8), # OLS Models do not have AIC values
rep(NA, 5), # PLS Models do not have AIC values
knn_aic_bic["AIC"], mars_aic_bic["AIC"], svr_aic_bic["AIC"], ann_aic_bic["AIC"],
bagged_aic_bic["AIC"], rf_aic_bic["AIC"], boost_aic_bic["AIC"], xgb_aic_bic["AIC"]
),
BIC = c(
rep(NA, 8), # OLS Models do not have BIC values
rep(NA, 5), # PLS Models do not have BIC values
knn_aic_bic["BIC"], mars_aic_bic["BIC"], svr_aic_bic["BIC"], ann_aic_bic["BIC"],
bagged_aic_bic["BIC"], rf_aic_bic["BIC"], boost_aic_bic["BIC"], xgb_aic_bic["BIC"]
)
)
# Sort the data frame by Test_RMSE
all_models_performance <- all_models_performance %>% arrange(Test_RMSE)
# Display the consolidated DataFrame
all_models_performance %>%
kable(
caption = "Consolidated Model Performance Metrics"
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
) %>%
column_spec(1, width = "10cm") %>%
column_spec(2:7, width = "6cm")
Model_Type | Model | Train_R2 | Test_R2 | Train_RMSE | Test_RMSE | AIC | BIC |
---|---|---|---|---|---|---|---|
Tree-Based | Random Forest | 0.7011440 | 0.7011440 | 0.1000187 | 0.1000187 | 93.5700 | 2213.7079 |
Tree-Based | GBM | 0.6380685 | 0.6380685 | 0.1075398 | 0.1075398 | 67.9596 | 1976.0837 |
Tree-Based | XGBoost | 0.6223548 | 0.6223548 | 0.1099225 | 0.1099225 | -409.5562 | 438.4990 |
Non-Linear | SVR | 0.5242925 | 0.5381500 | 0.1184443 | 0.1216659 | 2794.5859 | 10215.0686 |
Tree-Based | Bagged Trees | 0.5186901 | 0.5186901 | 0.1251824 | 0.1251824 | -630.8485 | -546.0429 |
Non-Linear | ANN | 0.4810389 | 0.4972056 | 0.1248814 | 0.1276438 | -186.2019 | 810.2630 |
Non-Linear | KNN | 0.4680361 | 0.4892592 | 0.1258932 | 0.1286795 | -647.9108 | -647.9108 |
OLS | Interaction Terms | 0.6187726 | 0.4706976 | 0.1054870 | 0.1300389 | NA | NA |
Non-Linear | MARS | 0.4347175 | 0.4500877 | 0.1288138 | 0.1327350 | -584.0745 | -516.2301 |
PLS | Lasso Regression | 0.4015648 | 0.3683422 | 0.1321666 | 0.1421016 | NA | NA |
PLS | Ridge Regression | 0.4003721 | 0.3695003 | 0.1323733 | 0.1421150 | NA | NA |
OLS | Feature Engineering | 0.4023797 | 0.3678177 | 0.1320747 | 0.1421158 | NA | NA |
OLS | Recursive Feature Elimination (RFE) | 0.4016072 | 0.3677272 | 0.1321600 | 0.1421260 | NA | NA |
OLS | Backward Elimination | 0.4005068 | 0.3659733 | 0.1322815 | 0.1423230 | NA | NA |
OLS | Forward Selection | 0.4005068 | 0.3659733 | 0.1322815 | 0.1423230 | NA | NA |
OLS | Stepwise Selection | 0.4005068 | 0.3659733 | 0.1322815 | 0.1423230 | NA | NA |
PLS | PCR | 0.3889516 | 0.3659666 | 0.1335502 | 0.1423654 | NA | NA |
OLS | Polynomial Features | 0.4054491 | 0.3635935 | 0.1317351 | 0.1425899 | NA | NA |
PLS | PLS | 0.3864426 | 0.3606606 | 0.1338241 | 0.1429587 | NA | NA |
OLS | MLR (All Features Included) | 0.4016072 | -82.0321851 | 1.6236157 | 1.6287129 | NA | NA |
PLS | Elastic Net | 0.0442056 | 0.0235254 | 194.6793992 | 194.0983447 | NA | NA |
PH
) on a new evaluation dataset that
lacks the ground truth for comparison.Below are the code blocks to make predictions on the
evaluation_set_boxcox
dataset using the Random
Forest (best model) and alternative models (Interaction
Terms OLS, SVR, and XGBoost).
Necessary data preprocessing steps for each model are included, along
with forecasting and evaluation plots.
# Load required libraries
library(randomForest)
library(caret)
library(e1071)
library(xgboost)
library(dplyr)
library(ggplot2)
library(Metrics)
# Load the evaluation dataset
evaluation_data <- evaluation_set_transformed
# Imputed dataset for evaluation
evaluation_data_imputed <- evaluation_set_imputed
# Random Forest Predictions
set.seed(123)
rf_predictions <- predict(rf_model, newdata = tree_based_evaluation_data)
# Add predictions to evaluation dataset
tree_based_evaluation_data$PH_predicted_rf <- rf_predictions
# Reorder columns to place PH_predicted_rf next to PH
if ("PH" %in% colnames(tree_based_evaluation_data)) {
col_order <- c("PH", "PH_predicted_rf", setdiff(names(tree_based_evaluation_data), c("PH", "PH_predicted_rf")))
tree_based_evaluation_data <- tree_based_evaluation_data[, col_order]
}
# Save results to a CSV file
write.csv(tree_based_evaluation_data, "rf_evaluation_results_with_predictions.csv", row.names = FALSE)
# Plot Predicted PH Distribution
ggplot(tree_based_evaluation_data, aes(x = PH_predicted_rf)) +
geom_histogram(binwidth = 0.1, fill = "blue", alpha = 0.7) +
labs(
title = "Predicted PH Distribution (Random Forest)",
x = "Predicted PH",
y = "Frequency"
) +
theme_minimal()
# Display the updated dataset
str(tree_based_evaluation_data)
## 'data.frame': 267 obs. of 35 variables:
## $ PH : logi NA NA NA NA NA NA ...
## $ PH_predicted_rf : num 8.63 8.63 8.61 8.62 8.58 ...
## $ Brand.Code : Factor w/ 4 levels "A","B","C","D": 4 1 2 2 2 2 1 2 1 4 ...
## $ Carb.Volume : num 5.48 5.39 5.29 5.27 5.41 ...
## $ Fill.Ounces : num 24 24 23.9 23.9 24.2 ...
## $ PC.Volume : num 0.278 0.232 0.313 0.19 0.163 ...
## $ Carb.Pressure : num 0.471 0.471 0.471 0.471 0.471 ...
## $ Carb.Temp : num 135 135 140 139 142 ...
## $ PSC : num 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
## $ PSC.Fill : num 0.2144 0.1509 0.0832 0.1414 0.1834 ...
## $ PSC.CO2 : num 0.04 0.08 0.02 0.02 0.06 0.02 0 0.04 0.04 0.02 ...
## $ Mnf.Flow : num -115 -115 -115 -115 -115 ...
## $ Carb.Pressure1 : num 9.74 9.81 9.85 9.99 9.69 ...
## $ Fill.Pressure : num 9.73 9.76 9.71 9.05 10.31 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 3.68 3.77 3.69 3.88 3.66 ...
## $ Filler.Level : num 129 120 119 120 116 ...
## $ Filler.Speed : int 3986 4012 4010 1016 4018 4010 4010 3986 4010 1006 ...
## $ Temperature : num 66 65.6 65.6 74.4 66.4 66.6 66.8 70.6 65.8 66 ...
## $ Usage.cont : num 21.7 17.6 24.2 18.1 21.3 ...
## $ Carb.Flow : num 4624 4567.4 4801 32.9 5065.5 ...
## $ Density : num 0.671 1.001 0.683 0.584 0.671 ...
## $ MFR : num 728 736 735 313 752 ...
## $ Balling : num 0.533 0.658 0.54 0.476 0.533 ...
## $ Pressure.Vacuum : num -48.2 -70.6 -62.5 -55 -55 ...
## $ Oxygen.Filler : num 0.022 0.03 0.046 0.086 0.082 0.064 0.042 0.096 0.046 0.096 ...
## $ Bowl.Setpoint : int 130 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 0.452 0.452 0.452 0.452 0.452 ...
## $ Air.Pressurer : num 143 147 147 146 146 ...
## $ Alch.Rel : num 6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
## $ Carb.Rel : num 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
## $ Balling.Lvl : num 0.942 1.477 0.933 0.942 0.933 ...
## $ Brand_Code : num 4 1 2 2 2 2 1 2 1 4 ...
# Load necessary libraries
library(ggplot2)
library(caret)
library(gbm)
# Set seed for reproducibility
set.seed(123)
# Generate predictions on the evaluation dataset
gbm_predictions <- predict(boost_model, newdata = tree_based_evaluation_data)
# Add predictions to evaluation dataset
tree_based_evaluation_data$PH_predicted_gbm <- gbm_predictions
# Reorder columns to place PH_predicted_gbm next to PH
if ("PH" %in% colnames(tree_based_evaluation_data)) {
col_order <- c("PH", "PH_predicted_gbm", setdiff(names(tree_based_evaluation_data), c("PH", "PH_predicted_gbm")))
tree_based_evaluation_data <- tree_based_evaluation_data[, col_order]
}
# Save results to a CSV file
write.csv(tree_based_evaluation_data, "gbm_evaluation_results_with_predictions.csv", row.names = FALSE)
# Plot Predicted PH Distribution
ggplot(tree_based_evaluation_data, aes(x = PH_predicted_gbm)) +
geom_histogram(binwidth = 0.1, fill = "blue", alpha = 0.7) +
labs(
title = "Predicted PH Distribution (GBM)",
x = "Predicted PH",
y = "Frequency"
) +
theme_minimal()
# Display the updated dataset
str(tree_based_evaluation_data)
## 'data.frame': 267 obs. of 36 variables:
## $ PH : logi NA NA NA NA NA NA ...
## $ PH_predicted_gbm : num 8.72 8.73 8.75 8.79 8.68 ...
## $ PH_predicted_rf : num 8.63 8.63 8.61 8.62 8.58 ...
## $ Brand.Code : Factor w/ 4 levels "A","B","C","D": 4 1 2 2 2 2 1 2 1 4 ...
## $ Carb.Volume : num 5.48 5.39 5.29 5.27 5.41 ...
## $ Fill.Ounces : num 24 24 23.9 23.9 24.2 ...
## $ PC.Volume : num 0.278 0.232 0.313 0.19 0.163 ...
## $ Carb.Pressure : num 0.471 0.471 0.471 0.471 0.471 ...
## $ Carb.Temp : num 135 135 140 139 142 ...
## $ PSC : num 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
## $ PSC.Fill : num 0.2144 0.1509 0.0832 0.1414 0.1834 ...
## $ PSC.CO2 : num 0.04 0.08 0.02 0.02 0.06 0.02 0 0.04 0.04 0.02 ...
## $ Mnf.Flow : num -115 -115 -115 -115 -115 ...
## $ Carb.Pressure1 : num 9.74 9.81 9.85 9.99 9.69 ...
## $ Fill.Pressure : num 9.73 9.76 9.71 9.05 10.31 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 3.68 3.77 3.69 3.88 3.66 ...
## $ Filler.Level : num 129 120 119 120 116 ...
## $ Filler.Speed : int 3986 4012 4010 1016 4018 4010 4010 3986 4010 1006 ...
## $ Temperature : num 66 65.6 65.6 74.4 66.4 66.6 66.8 70.6 65.8 66 ...
## $ Usage.cont : num 21.7 17.6 24.2 18.1 21.3 ...
## $ Carb.Flow : num 4624 4567.4 4801 32.9 5065.5 ...
## $ Density : num 0.671 1.001 0.683 0.584 0.671 ...
## $ MFR : num 728 736 735 313 752 ...
## $ Balling : num 0.533 0.658 0.54 0.476 0.533 ...
## $ Pressure.Vacuum : num -48.2 -70.6 -62.5 -55 -55 ...
## $ Oxygen.Filler : num 0.022 0.03 0.046 0.086 0.082 0.064 0.042 0.096 0.046 0.096 ...
## $ Bowl.Setpoint : int 130 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 0.452 0.452 0.452 0.452 0.452 ...
## $ Air.Pressurer : num 143 147 147 146 146 ...
## $ Alch.Rel : num 6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
## $ Carb.Rel : num 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
## $ Balling.Lvl : num 0.942 1.477 0.933 0.942 0.933 ...
## $ Brand_Code : num 4 1 2 2 2 2 1 2 1 4 ...
# Predictions for XGBoost
set.seed(123)
# Generate predictions on the evaluation dataset
xgb_predictions <- predict(xgb_refined_model, newdata = gradient_models_evaluation_data_final)
# Add predictions to evaluation dataset
tree_based_evaluation_data$PH_predicted_xgb <- xgb_predictions
# Reorder columns to place PH_predicted_xgb next to PH
if ("PH" %in% colnames(tree_based_evaluation_data)) {
col_order <- c("PH", "PH_predicted_xgb", setdiff(names(tree_based_evaluation_data), c("PH", "PH_predicted_xgb")))
tree_based_evaluation_data <- tree_based_evaluation_data[, col_order]
}
# Save results to a CSV file
write.csv(tree_based_evaluation_data, "xgb_evaluation_results_with_predictions.csv", row.names = FALSE)
# Plot Predicted PH Distribution
ggplot(tree_based_evaluation_data, aes(x = PH_predicted_xgb)) +
geom_histogram(binwidth = 0.1, fill = "blue", alpha = 0.7) +
labs(
title = "Predicted PH Distribution (XGBoost)",
x = "Predicted PH",
y = "Frequency"
) +
theme_minimal()
# Display the updated dataset
str(tree_based_evaluation_data)
## 'data.frame': 267 obs. of 37 variables:
## $ PH : logi NA NA NA NA NA NA ...
## $ PH_predicted_xgb : num 8.56 8.59 8.57 8.59 8.55 ...
## $ PH_predicted_gbm : num 8.72 8.73 8.75 8.79 8.68 ...
## $ PH_predicted_rf : num 8.63 8.63 8.61 8.62 8.58 ...
## $ Brand.Code : Factor w/ 4 levels "A","B","C","D": 4 1 2 2 2 2 1 2 1 4 ...
## $ Carb.Volume : num 5.48 5.39 5.29 5.27 5.41 ...
## $ Fill.Ounces : num 24 24 23.9 23.9 24.2 ...
## $ PC.Volume : num 0.278 0.232 0.313 0.19 0.163 ...
## $ Carb.Pressure : num 0.471 0.471 0.471 0.471 0.471 ...
## $ Carb.Temp : num 135 135 140 139 142 ...
## $ PSC : num 0.236 0.042 0.068 0.004 0.04 0.078 0.088 0.076 0.246 0.146 ...
## $ PSC.Fill : num 0.2144 0.1509 0.0832 0.1414 0.1834 ...
## $ PSC.CO2 : num 0.04 0.08 0.02 0.02 0.06 0.02 0 0.04 0.04 0.02 ...
## $ Mnf.Flow : num -115 -115 -115 -115 -115 ...
## $ Carb.Pressure1 : num 9.74 9.81 9.85 9.99 9.69 ...
## $ Fill.Pressure : num 9.73 9.76 9.71 9.05 10.31 ...
## $ Hyd.Pressure1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyd.Pressure4 : num 3.68 3.77 3.69 3.88 3.66 ...
## $ Filler.Level : num 129 120 119 120 116 ...
## $ Filler.Speed : int 3986 4012 4010 1016 4018 4010 4010 3986 4010 1006 ...
## $ Temperature : num 66 65.6 65.6 74.4 66.4 66.6 66.8 70.6 65.8 66 ...
## $ Usage.cont : num 21.7 17.6 24.2 18.1 21.3 ...
## $ Carb.Flow : num 4624 4567.4 4801 32.9 5065.5 ...
## $ Density : num 0.671 1.001 0.683 0.584 0.671 ...
## $ MFR : num 728 736 735 313 752 ...
## $ Balling : num 0.533 0.658 0.54 0.476 0.533 ...
## $ Pressure.Vacuum : num -48.2 -70.6 -62.5 -55 -55 ...
## $ Oxygen.Filler : num 0.022 0.03 0.046 0.086 0.082 0.064 0.042 0.096 0.046 0.096 ...
## $ Bowl.Setpoint : int 130 120 120 120 120 120 120 120 120 120 ...
## $ Pressure.Setpoint: num 0.452 0.452 0.452 0.452 0.452 ...
## $ Air.Pressurer : num 143 147 147 146 146 ...
## $ Alch.Rel : num 6.56 7.14 6.52 6.48 6.5 6.5 7.18 7.16 7.14 7.78 ...
## $ Carb.Rel : num 5.34 5.58 5.34 5.5 5.38 5.42 5.46 5.42 5.44 5.52 ...
## $ Balling.Lvl : num 0.942 1.477 0.933 0.942 0.933 ...
## $ Brand_Code : num 4 1 2 2 2 2 1 2 1 4 ...
The pairwise scatterplots compare the predicted PH values across three models: Random Forest (RF), Gradient Boosting Machine (GBM), and XGBoost (XGB). The following observations can be made:
This analysis highlights the reliability of the models, while also pointing to minor variations that could be further explored depending on application requirements.
# Combine predictions into a single data frame
predictions_df <- tree_based_evaluation_data %>%
select(PH_predicted_rf, PH_predicted_gbm, PH_predicted_xgb)
# Display summary of predicted PH values
summary(predictions_df)
## PH_predicted_rf PH_predicted_gbm PH_predicted_xgb
## Min. :8.338 Min. :8.320 Min. :8.192
## 1st Qu.:8.484 1st Qu.:8.565 1st Qu.:8.418
## Median :8.537 Median :8.636 Median :8.507
## Mean :8.546 Mean :8.632 Mean :8.497
## 3rd Qu.:8.609 3rd Qu.:8.702 3rd Qu.:8.588
## Max. :8.744 Max. :8.875 Max. :8.701
# Pairwise scatterplots to visualize consistency between models
pairs(predictions_df,
main = "Predicted PH Values: Model Comparisons",
pch = 21,
bg = c("blue", "green", "red", "purple"))
Let me know if you need further refinements or additional evaluation!