OASBUD Lesion Classification.

Author

My (Mimi) Tran

I. Data Wrangling and EDA

Libraries and loading data

# Library
library(tidyverse)
library(tidymodels)
library(dplyr)
library(recipes)

# Load data
master <- read_csv("Master_Feature_Table.csv")
head(master)
# A tibble: 6 × 25
  Lesion ID    Scan  Class BIRADS Mu_Intra Omega_Intra Mu_Peri Omega_Peri
   <dbl> <chr> <chr> <dbl> <chr>     <dbl>       <dbl>   <dbl>      <dbl>
1      1 1mm   rf1       0 3         0.556     477917.   0.373   4652261.
2      1 1mm   rf2       0 3         0.779     215309.   0.331   4884297.
3      2 2mm   rf1       0 3         0.652     256974.   0.341   8680795.
4      2 2mm   rf2       0 3         0.725     165450.   0.448   3775708.
5      3 3als  rf1       0 4a        0.731     161513.   0.379   2919763.
6      3 3als  rf2       0 4a        0.588     289491.   0.383   7814134.
# ℹ 16 more variables: Alpha_Intra <dbl>, Theta_Intra <dbl>, Alpha_Peri <dbl>,
#   Theta_Peri <dbl>, X_Intra <dbl>, U_Intra <dbl>, X_Peri <dbl>, U_Peri <dbl>,
#   Contrast_Intra <dbl>, Contrast_Peri <dbl>, Correlation_Intra <dbl>,
#   Correlation_Peri <dbl>, Energy_Intra <dbl>, Energy_Peri <dbl>,
#   Homogeneity_Intra <dbl>, Homogeneity_Peri <dbl>
#View(master)

Correct data type: `Class` , `ID` and `BIRADS` columns are currently numbers or characters, but for ML in R, they need to be factors(categorical variables).

 master <- master %>% mutate(
   ID = as.factor(ID),
   Class = as.factor(Class),
   BIRADS = as.factor(BIRADS)
 )
 str(master[, c("ID", "BIRADS", "Class")])
tibble [200 × 3] (S3: tbl_df/tbl/data.frame)
 $ ID    : Factor w/ 99 levels "102ks","10jw",..: 37 37 54 54 61 61 70 70 80 80 ...
 $ BIRADS: Factor w/ 5 levels "3","4a","4b",..: 1 1 1 1 2 2 5 5 3 3 ...
 $ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...

Exploratory Data Analysis:

  1. Class check.

    First, we need to know how many Benign (0) vs. Malignant (1) cases we have, and how they align with the BIRADS scores.

# count of class 0 and 1
master_count <- master %>% count(Class)
master_count
# A tibble: 2 × 2
  Class     n
  <fct> <int>
1 0        96
2 1       104
# see how does class relate to birads
master_count.birad <- master%>% count(BIRADS, Class)
master_count.birad
# A tibble: 8 × 3
  BIRADS Class     n
  <fct>  <fct> <int>
1 3      0        62
2 4a     0        18
3 4a     1         4
4 4b     0        14
5 4b     1        18
6 4c     1        34
7 5      0         2
8 5      1        48

There are 96 Benign case and 102 Malignant cases. Most of the benign cases ( Class 0) align with lower `BIRADS` and malignant cases (Class 1) align with higher `BIRADS` . However, there will be some challenge in BIRADS 4a and 4b. Most of 4a are benign but some are malignant and most of 4b are malignant but some are benign. This is a good place to use Kolmogorov-Smirnov Test to see whether or not their distributions (4a vs 4b) are distinct.

The anomaly here is there are two cases of benign that have BIRADS 5 ~ this is false positive. We can use Proportion of Rejections test to evaluate this further.

  1. Next, we want to check normality for all the features for each class.
master_long_classed <- master %>%
  select(Class, where(is.numeric)) %>%
  select(-any_of(c("ID", "Lesion", "Scan"))) %>% 
  pivot_longer(cols = -Class, names_to = "feature", values_to = "value")

ggplot(master_long_classed, aes(x = value, fill = factor(Class))) +
  geom_density(alpha = 0.5) +
  facet_wrap(~feature, scales = "free") +
  scale_fill_manual(values = c("0" = "green", "1" = "red")) +
  theme_minimal() +
  theme(legend.position = "bottom")

Looking at density plots above, we see that distributions that exhibit strong seperations between the Benign (Green) and Malignant (Red) classes are: distributions: U_Intra, U_Peri, X_Intra, X_Peri, Mu_Intra, Alpha_Intra. This visual gap provides the groundwork for the Kolmogorov-Smirnov (KS) test.

II. Test Statistics

1. Two-Sample Kolmogorov-Smirnov (KS) Test

Before we head to Machine Learning models, we will use the Two-Sample Kolmogorov-Smirnov (KS) Test.

Suppose we have two samples size \(m\) and size \(n\)The Two-Sample Kolmogorov-Smirnov (KS) Test use to test whether or not two samples came from the same distribution. The null hypothesis is \(H_0\): both samples (Class 0 and Class 1) come from a population with the same distribution. The test statistics is D - statistics which has values between 0 and 1. The higher the D the more separated “Green” and “Red” curves are. And ultimately if the p-value < 0.05, we reject \(H_0\) and say there is a statistically significant difference between Benign and Malignant for that feature.

# Create a summary table of KS results for all numeric features
ks_summary <- master %>%
  select(where(is.numeric)) %>%
  select(-any_of(c("Class", "ID", "Scan", "BIRADS", "Lesion"))) %>%
  map_df(~{
    test <- ks.test(.x[master$Class == 0], .x[master$Class == 1])
    data.frame(D_statistic = test$statistic, p_value = test$p.value)
  }, .id = "Feature") %>%
  arrange(desc(D_statistic)) # put the most significant at the top 

print(ks_summary)
                 Feature D_statistic      p_value
D...1         Omega_Peri  0.45753205 5.694600e-10
D...2         Theta_Peri  0.41586538 2.850998e-08
D...3             X_Peri  0.40705128 6.179996e-08
D...4             U_Peri  0.39983974 1.147959e-07
D...5            X_Intra  0.33493590 1.753000e-05
D...6            U_Intra  0.33493590 1.753000e-05
D...7        Omega_Intra  0.30448718 1.335927e-04
D...8        Theta_Intra  0.26362179 1.482570e-03
D...9        Alpha_Intra  0.20112179 2.957408e-02
D...10 Correlation_Intra  0.19471154 3.838945e-02
D...11      Energy_Intra  0.16746795 1.057806e-01
D...12  Correlation_Peri  0.15224359 1.744454e-01
D...13          Mu_Intra  0.14102564 2.444385e-01
D...14           Mu_Peri  0.14022436 2.501625e-01
D...15       Energy_Peri  0.13782051 2.679070e-01
D...16        Alpha_Peri  0.11778846 4.503562e-01
D...17  Homogeneity_Peri  0.10897436 5.481896e-01
D...18 Homogeneity_Intra  0.08173077 8.578663e-01
D...19     Contrast_Peri  0.06009615 9.868820e-01
D...20    Contrast_Intra  0.05929487 9.885425e-01

For those features that are statistically significant, we identify them as the most viable biomarkers for the classification task.

2. Class 0 Reference Distributions and Evaluation of Malignant Rejection Rates

The procedure:

  • For each feature (Nakagami, Gamma, etc.), we take all Benign scans and turn their parameters into curves (plug the parameters into their distribution’s pdf). Average those curves to create one single distribution.

  • Once we have the single distribution data we decide where “normal” ends and “non-normal” begins. And how do we do this? we marks the outer 5% of the curve as rejection region. If a lesion falls within this rejection region, it is statistically rejected because it doesn’t look like typical benign lesion.

  • Next, we test all 104 malignant observations. We look at where their values land on the x-axis relative to single distribution data.

    If a malignant lands on that 5% rejection region, the test is a success - it “found” the cancer.

    If it lands in the normal region, the test “missed” it.

  • Finally, we calculate the proportion of rejection. We want to know: What proportion of the malignant cases did the single distribution correctly find as being not benign.

This test will let us know which distribution (Nakagami, Gamma, and HK) is the better tool for this analysis.

Let’s start with Nakagami PDF.

Nakagami PDF:

\(X \sim Nakagami(\mu, \Omega)\)

\[ f(x)=\frac{2m^m}{\Gamma(m)\Omega^m}x^{2m-1}\exp\bigg(-\frac{mx^2}{\Omega}\bigg) \]

#define Nakagami PDF:
nakagami_pdf <- function(x, mu, omega){
  term1 <- (2*mu^mu)/(gamma(mu)*omega^mu)
  term2 <- x^(2*mu-1)
  term3<-exp(-(mu/omega)*x^2)
  return(term1*term2*term3)
}

# filter class 0
class0_data <- master %>% filter(Class==0) 

#head(class0_dat)

#Nakagami table:
# Since both mu features are not statistcally significant, we use the average
avg_mu_intra_0 <- mean(class0_data$Mu_Intra)

avg_mu_peri_0 <- mean(class0_data$Mu_Peri)

class0_nakagami <- class0_data %>% 
  select( M= Mu_Peri, Omega = Omega_Peri)

# Table for INTRA region
class0_nakagami_intra <- class0_data %>%
  select(Omega = Omega_Intra) %>% 
  mutate(Mu = avg_mu_intra_0)

# Table for PERI region
class0_nakagami_peri <- class0_data %>% 
  select(Omega = Omega_Peri) %>% 
  mutate(Mu = avg_mu_peri_0)

 

# Define x-axis: (for pixel: 0=black, to 255 = white)
# checking range of class0_nakagami_peri and class0_nakagami_intra values to get idea of x_axis range
summary(class0_nakagami_peri$Mu)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3908  0.3908  0.3908  0.3908  0.3908  0.3908 
summary(class0_nakagami_peri$Omega)# max is roughly 45 millions, Omega is sqrt(intensity) which is 6764.487
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  921804  3537296  5716001  8075811  9036075 45758278 
max_intensity <- 7000
x_axis <- seq(1, max_intensity, length.out=2000)

#Peri
peri_pdf <- sapply(1:nrow(class0_nakagami_peri), function(i) {
  nakagami_pdf(x_axis, 
               mu = class0_nakagami_peri$Mu[i], 
               omega = class0_nakagami_peri$Omega[i])
})

#averaging to get a single per
single_peri_pdf <- rowMeans(peri_pdf)
#head(single_peri_pdf)

# Intra
intra_pdf <- sapply(1:nrow(class0_nakagami_intra), function(i) {
  nakagami_pdf(x_axis, 
               mu = class0_nakagami_intra$Mu[i], 
               omega = class0_nakagami_intra$Omega[i])
})
single_intra_pdf <- rowMeans(intra_pdf)
#head(single_intra_pdf)

# plot to see
plot(x_axis, single_peri_pdf, type="l", col="blue", lwd=3,
     main = "Benign Template",
     xlab = "Pixel Intensity (Amplitude)",
     ylab = "Probability")
lines(x_axis, single_intra_pdf, col="red", lwd=3)

The blue curve is the peri tumoral region (tumor boundary) and the red curve represents the intra tumoral region (tumor center). From the gragh we see that the blue curve spikes around 0 (\(\mu <0.5\)). This characterizes Pre-Rayleig scattering - indicating regions with low scatterer density (dark) and irregular area at the tumor edge. The red curve peaks away from 0 (\(\mu >0.5\)), maybe near the value 500, indicates characteristic of Rayleigh or Post-Rayleigh scattering. This suggests the center of benign tumors is brighter and contains more regular patterns compared to the edges.

Thus the strong visual separation between Peri and Intra templates confirm that statistical “characters” of tissue varies by region. Therefore, these distinct Class 0 provides a well baseline for calculating the rejection rates of Class 1

#Find the 95% rejection threshold
# Function to calculate where the area under the curve hits 95%
get_cutoff <- function(x_vals, pdf_curve) {
  step_size <- x_vals[2] - x_vals[1]
  cumulative_area <- cumsum(pdf_curve * step_size)
  
  # normalize to ensure max is exactly 1.0
  cumulative_area <- cumulative_area / max(cumulative_area)
  
  # find the first X value where the area passes 0.95
  cutoff_index <- which(cumulative_area >= 0.95)[1]
  
  return(x_vals[cutoff_index])
}

# Calculate cutoffs for your generated templates
cutoff_peri <- get_cutoff(x_axis, single_peri_pdf)
cutoff_intra <- get_cutoff(x_axis, single_intra_pdf)

cat("Benign Cutoff Threshold (Peri):", round(cutoff_peri, 2), "\n")
Benign Cutoff Threshold (Peri): 4993.78 
cat("Benign Cutoff Threshold (Intra):", round(cutoff_intra, 2), "\n")
Benign Cutoff Threshold (Intra): 1338.48 
# Test the malignant cases (Class 1)

class1_data <- master %>% filter(Class == 1)

#Convert Omega (Power) to Amplitude
# our data is Omega (Intensity^2).
# we must take the Square Root of Omega to compare apples to apples.
malig_scores_peri <- sqrt(class1_data$Omega_Peri)
malig_scores_intra <- sqrt(class1_data$Omega_Intra)

# Check which malignant cases fall past the benign cutoff
# TRUE = Rejected (detected as anomaly), FALSE = Accepted (missed)
rejected_peri <- malig_scores_peri > cutoff_peri
rejected_intra <- malig_scores_intra > cutoff_intra

# Calculate the percentage
rate_peri <- mean(rejected_peri) * 100
rate_intra <- mean(rejected_intra) * 100


cat("FINAL NAKAGAMI REJECTION RATES:\n")
FINAL NAKAGAMI REJECTION RATES:
cat("Peri-Tumoral Detection Rate:", round(rate_peri, 2), "%\n")
Peri-Tumoral Detection Rate: 0.96 %
cat("Intra-Tumoral Detection Rate:", round(rate_intra, 2), "%\n")
Intra-Tumoral Detection Rate: 2.88 %
# visualize the cutoff lines
library(ggplot2)

plot_data <- data.frame(
  Amplitude = c(x_axis, x_axis),
  Density   = c(single_peri_pdf, single_intra_pdf),
  Group     = rep(c("Peri Template", "Intra Template"), each = length(x_axis))
)

ggplot(plot_data, aes(x = Amplitude, y = Density, fill = Group, color = Group)) +
  geom_area(alpha = 0.45, position = "identity") +
  geom_line(linewidth = 1) +
  geom_vline(xintercept = cutoff_peri, color = "blue", linetype = 2, linewidth = 1) +
  geom_vline(xintercept = cutoff_intra, color = "red", linetype = 2, linewidth = 1) +
  labs(
    title = "Nakagami Parameter-Based Rejection Test",
    x = "Pixel Intensity (Amplitude)",
    y = "Probability"
  ) +
  scale_color_discrete(name = NULL) +
  scale_fill_discrete(name = NULL) +
  theme_minimal()

From the rejection result, Nakagami model failed to distinguish between Benign and Malignant tumor: Peri Detection Rate: 0.96%, meaning uut of roughly 100 cancer cases, this test only caught 1. Intra Detection Rate: 2.88% - it caught roughly 3 out of 100 cancer cases. After all, both blue curve and red curve have wide long right tail - meaning benign and malignant tumors can be anything from 0 to a few thousands amplitude, provided poor discrimination between benign and malignant lesions.

Since the simple Nakagami model failed to catch the cancer, we can move on to Gamma and HK to see if they can capture the “clustering” of cancer cells better.

Gamma PDF:

\(X \sim Gamma(\alpha, \lambda)\)

\[ f(x)=\dfrac{x^{\alpha-1}e^{-\lambda x}\lambda^\alpha}{\Gamma(\alpha)}, \;\text{for}; x>0, \alpha, \lambda >0 \]

# GAMMA
# GAMMA
library(tidyverse)
library(ggplot2)

class0_data <- master %>% filter(Class == 0)

max_power <- 46000000
x_axis_gamma <- seq(2000, max_power, length.out = 10000)

# generate peri curve
gamma_matrix_peri <- sapply(1:nrow(class0_data), function(i) {
  dgamma(
    x_axis_gamma,
    shape = class0_data$Alpha_Peri[i],
    scale = class0_data$Theta_Peri[i]
  )
})
single_gamma_peri <- rowMeans(gamma_matrix_peri)

# generate intra curve
gamma_matrix_intra <- sapply(1:nrow(class0_data), function(i) {
  dgamma(
    x_axis_gamma,
    shape = class0_data$Alpha_Intra[i],
    scale = class0_data$Theta_Intra[i]
  )
})
single_gamma_intra <- rowMeans(gamma_matrix_intra)

# Calculate rejection cutoff
get_cutoff <- function(x_vals, pdf_curve) {
  step_size <- x_vals[2] - x_vals[1]
  cumulative_area <- cumsum(pdf_curve * step_size)
  cumulative_area <- cumulative_area / max(cumulative_area)
  cutoff_index <- which(cumulative_area >= 0.95)[1]
  return(x_vals[cutoff_index])
}

cutoff_gamma_peri <- get_cutoff(x_axis_gamma, single_gamma_peri)
cutoff_gamma_intra <- get_cutoff(x_axis_gamma, single_gamma_intra)

cat("Gamma Cutoff (Peri):", round(cutoff_gamma_peri, 2), "\n")
Gamma Cutoff (Peri): 6600.26 
cat("Gamma Cutoff (Intra):", round(cutoff_gamma_intra, 2), "\n")
Gamma Cutoff (Intra): 2000 
# TEST MALIGNANT CASES
class1_data <- master %>% filter(Class == 1)

malig_intensity_peri <- class1_data$Alpha_Peri * class1_data$Theta_Peri
malig_intensity_intra <- class1_data$Alpha_Intra * class1_data$Theta_Intra

reject_gamma_peri <- malig_intensity_peri > cutoff_gamma_peri
reject_gamma_intra <- malig_intensity_intra > cutoff_gamma_intra

rate_gamma_peri <- mean(reject_gamma_peri) * 100
rate_gamma_intra <- mean(reject_gamma_intra) * 100

cat("FINAL GAMMA REJECTION RATES:\n")
FINAL GAMMA REJECTION RATES:
cat("Peri =", round(rate_gamma_peri, 2), "%\n")
Peri = 0 %
cat("Intra =", round(rate_gamma_intra, 2), "%\n")
Intra = 0.96 %
# HK-style shaded plot
plot_data_gamma <- data.frame(
  Power = c(x_axis_gamma, x_axis_gamma),
  Density = c(single_gamma_peri, single_gamma_intra),
  Group = rep(c("Peri (Edge)", "Intra (Center)"), each = length(x_axis_gamma))
)

ggplot(plot_data_gamma, aes(x = Power, y = Density, fill = Group, color = Group)) +
  geom_area(alpha = 0.45, position = "identity") +
  geom_line(linewidth = 1) +
  geom_vline(xintercept = cutoff_gamma_peri, color = "blue", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = cutoff_gamma_intra, color = "red", linetype = "dashed", linewidth = 1) +
  labs(
    title = "Gamma Parameter-Based Rejection Analysis",
    subtitle = paste0(
      "Peri Rejection Proportion: ", round(rate_gamma_peri, 2),
      "% | Intra Rejection Proportion: ", round(rate_gamma_intra, 2), "%"
    ),
    x = "Pixel Intensity (Power)",
    y = "Probability Density"
  ) +
  scale_fill_discrete(name = NULL) +
  scale_color_discrete(name = NULL) +
  theme_minimal()

Numerical result explains “vertical line” on the left of the plot - benign data hurtles so close to 0 (darkness) on the scale of 46 millions.

Peri tumoral detection is 0%. Model fails to tell them apart.

Intra tumoral detection is 0.96%. Same thing: model fails to tell them apart.

So far, we tested Amplitude (Nakagami) and found <1% detection rate. We tested Intensity (Gamma) and found 0% detection rate. This implies cancer can not be detected using brihghtness alone. Next we use Homodyned - K distribution which use clusteral structuring

Homodyded - K

Unlike the Gamma or Nakagami models, the HK distribution does not have a simple “closed-form” equation. Instead, it is defined by an integral of Bessel functions that must be solved numerically.

\[ P(x)=\int_0^\infty u.x.J_0(ux). J_0(u\epsilon).\Big(1+\dfrac{u^2\sigma^2}{2 \alpha}\Big)^{-\alpha} du \] where:

  • \(J_0\) is the Bessel function of first order.

  • \(\epsilon\) is the coherent signal (how steady the structure is)

  • \(x\): signal amplitude (this is what we measure)

  • \(\sigma^2\): diffused signal (background scattering signal)

  • \(\alpha\): clustering pararmeter (low \(\alpha=\) squished/spiked, high \(\alpha=\) scattering/uniform)

1. Extract the parameter \(\alpha\), \(\sigma\) and \(\epsilon\) from \(X\) and \(U\) ~ these are statistic, they are only the summary number describing data.

# define HK function
extract_hk_unqiue <- function(X, U) {
  # W = Log-Variance
  W_val <- U - X^2
  W_val[W_val < 0.01] <- 0.01 
  
  # ALPHA HK (Scatterer Density)
  alpha_hk <- 1 / W_val
  alpha_hk <- pmax(0.1, pmin(20, alpha_hk))
  
  # SIGMA HK (Scale / Diffuse Power)
  # Scaled by 1,000,000 for numeric stability
  sigma_sq_scaled <- (exp(X + 0.57721) / 2) / 1e6
  sigma_hk <- sqrt(sigma_sq_scaled)
  
  # EPSILON HK (Coherent Strength)
  epsilon_hk <- sigma_hk * 0.1
  
  return(data.frame(AlphaHK = alpha_hk, SigmaHK = sigma_hk, EpsilonHK = epsilon_hk))
}

# apply to tumor peri
peri_hk <- extract_hk_unqiue(master$X_Peri, master$U_Peri)
master$AlphaHK_Peri   <- peri_hk$AlphaHK
master$SigmaHK_Peri   <- peri_hk$SigmaHK
master$EpsilonHK_Peri <- peri_hk$EpsilonHK

# apply to intra peri
intra_hk <- extract_hk_unqiue(master$X_Intra, master$U_Intra)
master$AlphaHK_Intra   <- intra_hk$AlphaHK
master$SigmaHK_Intra   <- intra_hk$SigmaHK
master$EpsilonHK_Intra <- intra_hk$EpsilonHK
head(master)
# A tibble: 6 × 31
  Lesion ID    Scan  Class BIRADS Mu_Intra Omega_Intra Mu_Peri Omega_Peri
   <dbl> <fct> <chr> <fct> <fct>     <dbl>       <dbl>   <dbl>      <dbl>
1      1 1mm   rf1   0     3         0.556     477917.   0.373   4652261.
2      1 1mm   rf2   0     3         0.779     215309.   0.331   4884297.
3      2 2mm   rf1   0     3         0.652     256974.   0.341   8680795.
4      2 2mm   rf2   0     3         0.725     165450.   0.448   3775708.
5      3 3als  rf1   0     4a        0.731     161513.   0.379   2919763.
6      3 3als  rf2   0     4a        0.588     289491.   0.383   7814134.
# ℹ 22 more variables: Alpha_Intra <dbl>, Theta_Intra <dbl>, Alpha_Peri <dbl>,
#   Theta_Peri <dbl>, X_Intra <dbl>, U_Intra <dbl>, X_Peri <dbl>, U_Peri <dbl>,
#   Contrast_Intra <dbl>, Contrast_Peri <dbl>, Correlation_Intra <dbl>,
#   Correlation_Peri <dbl>, Energy_Intra <dbl>, Energy_Peri <dbl>,
#   Homogeneity_Intra <dbl>, Homogeneity_Peri <dbl>, AlphaHK_Peri <dbl>,
#   SigmaHK_Peri <dbl>, EpsilonHK_Peri <dbl>, AlphaHK_Intra <dbl>,
#   SigmaHK_Intra <dbl>, EpsilonHK_Intra <dbl>

2. Generate the Curves & Cutoffs

# Formula: Amplitude = sqrt(2 * Sigma^2 + Epsilon^2)

# benign amplitudes
benign_data <- master[master$Class == 0, ]
benign_amps <- sqrt(2 * (benign_data$SigmaHK_Peri^2) + (benign_data$EpsilonHK_Peri^2))

# malignant amplitudes
malignant_data <- master[master$Class == 1, ]
malignant_amps <- sqrt(2 * (malignant_data$SigmaHK_Peri^2) + (malignant_data$EpsilonHK_Peri^2))

# 95% rejection
rejection_threshold <- quantile(benign_amps, 0.95, na.rm = TRUE)


# How many malignant cases fall into that top 5% "Rejection Region"?
number_found <- sum(malignant_amps > rejection_threshold, na.rm = TRUE)
proportion_rejection <- (number_found / nrow(malignant_data)) * 100

#plot
library(ggplot2)
plot_data <- data.frame(
  Amplitude = c(benign_amps, malignant_amps),
  Group = c(rep("Benign", length(benign_amps)), rep("Malignant", length(malignant_amps)))
)

ggplot(plot_data, aes(x = Amplitude, fill = Group)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = rejection_threshold, color = "red", linetype = "dashed", linewidth = 1) +
  labs(title = "HK Parameter-Based Rejection Analysis",
       subtitle = paste("Rejection Proportion:", round(proportion_rejection, 2), "%"),
       x = "Calculated HK Amplitude", y = "Density") +
  theme_minimal()

Not much improve compare to Nakagami or Gamma either.

I tested all three physical models (Gamma, Nakagami, HK) using standard 95% rejection regions. All three yielded detection rates below 5%. This is because the malignant tumors in this dataset exhibit lower variance and amplitude than benign cases, causing the ‘upper-tail’ test to miss them. Therefore, a multi-variable Machine Learning approach is required to capture these inverted patterns since applying one parameter at a time may not capture cancer proportion well.

3. KS Bivariate Test

For a given feature (say U_Peri), each lesion should contribute one 2D observation: \(\text{U_Peri}_{rf1},\text{U_Peri}_{rf2}\).

So first we have to reshape from “long” (two rows per lesion) to “wide” (one row per lesion with two columns). Instead of looking at just one value \(X\) (like we did before with KS test), we look at pair \[X_{rf1},X_{rf2}\]. So every lesion becomes a point in 2D space.

Then we run a multivariate test statistics to produce p-values.

Interpretation:

  • Small p-value \(\rightarrow\) benign vs malignant look different in the 2D space.

  • Large p-value \(\rightarrow\) benign vs malignant have similar distributions, the feature probably cannot distinguish cancer.

We will use example plots to visualize this difference (or similarity)

# ============================================================
# BIVARIATE (rf1, rf2) TWO-SAMPLE TEST PER FEATURE (per lesion)
# Uses Lesion as the pairing key and energy distance test
# ============================================================

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(energy)

# 0) Quick checks
#------------------------------------------------------

# Confirm scans are rf1/rf2
unique(master$Scan)
[1] "rf1" "rf2"
# Confirm most lesions have 2 rows (rf1 + rf2); identify any odd ones
lesion_row_counts <- master %>% count(Lesion)
lesion_row_counts %>% count(n) #ok
Storing counts in `nn`, as `n` already present in input
ℹ Use `name = "new_name"` to pick a new name.
# A tibble: 1 × 2
      n    nn
  <int> <int>
1     2   100
# 1) Pair rf1/rf2 per lesion for a given feature (wide format)
# ----------------------------------------------------

pair_feature <- function(df, feature, key = "Lesion") {
  df %>%
    distinct() %>%
    select(all_of(key), Class, Scan, all_of(feature)) %>%
    mutate(Scan = as.character(Scan)) %>%
    group_by(across(all_of(c(key, "Scan")))) %>%  
    slice(1) %>%
    ungroup() %>%
    pivot_wider(
      names_from  = Scan,
      values_from = all_of(feature),
      names_prefix = paste0(feature, "_")
    ) %>%
    filter(!is.na(.data[[paste0(feature, "_rf1")]]),
           !is.na(.data[[paste0(feature, "_rf2")]]))
}

# 2) Run a bivariate two-sample test for one feature
# ------------------------------------------------------

run_bivar_test <- function(df, feature, key = "Lesion", R_perm = 999) {
  paired <- pair_feature(df, feature, key = key)
  X0 <- paired %>%
    filter(Class == 0) %>%
    select(ends_with("_rf1"), ends_with("_rf2")) %>%
    as.matrix()

  X1 <- paired %>%
    filter(Class == 1) %>%
    select(ends_with("_rf1"), ends_with("_rf2")) %>%
    as.matrix()

  # if something goes wrong (like too few cases), guard
  if (nrow(X0) < 5 || nrow(X1) < 5) {
    return(tibble(
      Feature = feature, n0 = nrow(X0), n1 = nrow(X1),
      statistic = NA_real_, p_value = NA_real_,
      note = "Too few paired lesions in one class"
    ))
  }

  # Energy distance two-sample test (permutation p-value)
  X <- rbind(X0, X1)
  et <- energy::eqdist.etest(X, sizes = c(nrow(X0), nrow(X1)), R = R_perm)

  tibble(
    Feature = feature,
    n0 = nrow(X0),
    n1 = nrow(X1),
    statistic = unname(et$statistic),
    p_value = unname(et$p.value),
    note = ""
  )
}


# 3) Choose your feature list
# ---------------------------

features_to_test <- c(
  # physics / distribution parameters
  "Mu_Intra", "Mu_Peri",
  "Omega_Intra", "Omega_Peri",
  "Alpha_Intra", "Alpha_Peri",
  "Theta_Intra", "Theta_Peri",
  "X_Intra", "X_Peri",
  "U_Intra", "U_Peri",

  # texture features
  "Contrast_Intra", "Contrast_Peri",
  "Correlation_Intra", "Correlation_Peri",
  "Energy_Intra", "Energy_Peri",
  "Homogeneity_Intra", "Homogeneity_Peri",

  # HK features
  "AlphaHK_Intra", "AlphaHK_Peri",
  "SigmaHK_Intra", "SigmaHK_Peri",
  "EpsilonHK_Intra", "EpsilonHK_Peri"
)

# Keep only features that actually exist in master
features_to_test <- features_to_test[features_to_test %in% names(master)]

# 4) Run all tests + summarize
# --------------------------------------------

set.seed(123)
bivar_results <- map_dfr(features_to_test, ~run_bivar_test(master, .x, key = "Lesion", R_perm = 999)) %>%
  arrange(p_value)
bivar_results 
# A tibble: 26 × 6
   Feature              n0    n1    statistic p_value note 
   <chr>             <int> <int>        <dbl>   <dbl> <chr>
 1 Omega_Peri           48    52 64552626.      0.001 ""   
 2 Theta_Peri           48    52    13329.      0.001 ""   
 3 X_Intra              48    52       12.4     0.001 ""   
 4 X_Peri               48    52       18.0     0.001 ""   
 5 U_Intra              48    52      299.      0.001 ""   
 6 U_Peri               48    52      474.      0.001 ""   
 7 SigmaHK_Peri         48    52        6.06    0.001 ""   
 8 EpsilonHK_Peri       48    52        0.606   0.001 ""   
 9 EpsilonHK_Intra      48    52        0.243   0.003 ""   
10 Correlation_Intra    48    52        0.433   0.004 ""   
# ℹ 16 more rows

From the table, strongest features are: Omega_Peri, Theta_Peri, X_Intra, X_Peri, U_Intra, U_Peri, `SigmaHK_Peri. All have p-values <= 0.001. This strongly suggest that benign and malignant have different joint distribution for these features.

Correlation_Intra, AlphaHK_Peri, AlphaHK_Peri, Omega_Intra, AlphaHK_Intra are still statistically significant, but weaker.

These features exhibit strong discriminatory potential and were considered candidates for subsequent modeling.

Next, we take a look at the scatter plot of some of the features reveal whether the bivariate separation exists

# 5) Quick bivariate plot for any one feature
# ---------------------------

plot_bivariate_feature <- function(df, feature, key = "Lesion") {
  paired <- pair_feature(df, feature, key = key)
  x1 <- paste0(feature, "_rf1")
  x2 <- paste0(feature, "_rf2")

  ggplot(paired, aes(x = .data[[x1]], y = .data[[x2]], color = factor(Class))) +
    geom_point(alpha = 0.8) +
    labs(
      title = paste0("Bivariate view: ", feature, " (rf1 vs rf2)"),
      x = paste0(feature, " rf1"),
      y = paste0(feature, " rf2"),
      color = "Class"
    ) +
    theme_minimal()
}

# Example:
plot_bivariate_feature(master, "U_Peri")

plot_bivariate_feature(master, "Mu_Intra")

  • U_Peri (p-value: 0.001) is statistically significant. The scatter plot shows somewhat two separate regions between benign and malignant.

  • Mu_Intra (p-value: 0.327) is not statistically significant. The two classes, benign and malignant, are overlap everywhere.

III. Predictive Models

1. Group Split and Normalization.

To prevent data from accidentally “memorizing” characteristics of the features, we perform the split first before normalizing dataset.

Since this is a small dataset, we use K-Fold Cross-Validation. It is more reliable than training/testing.

Why we use grouped K-fold CV?

Each lesion has repeated observations, such as rf1 and rf2. These are related. If one scan goes into the training set and the other goes into the test set, the model may get an unfair advantage because it has already seen something extremely similar. Thus result in data leakage.

set.seed(123)
lesion_folds <- group_vfold_cv(master, group = Lesion, v=10) # create the fold. 
#grouped CV prevent data leakage  by forcing all rows from the same lesion to stay in the same fold.
#we group by ID
# 100 lesion, each with 2 scans = 200 rows,we put 10 lesions (20 rows) aside and then use 
# the remaining 90 lesions (180 rows) to train, then it tests on 10 lesions above. 
#The process repeats 10 times and finally average the accuracy across all 10 tries.

# Normalizing

lesion_recipe <- recipe(Class ~ ., data = master) %>%
  update_role(ID, Scan, BIRADS, new_role = "ID") %>% step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors()) 
# only numeric variables are normalized; features are not normalize yet, since I dont want it calculate mean, variance now and might "memorize" the global distribution. 

#recipe() and prep(): estimated on training fold only
#applied to test fold

Normalization parameters are estimated within each training fold during cross-validation to avoid information leakage from the test data.

Here is what happens within cross validation process

For fold 1 (1 fold = 10 lesions = 20 rows):

  • use 9 folds as training data

  • use 1 fold as test data

  • estimate normalization values from the training part only

  • apply those training-based values to both train and test

  • fit the model on the training data

  • predict on the test fold

  • record metrics such as accuracy, ROC AUC, sensitivity, specificity

Then repeat for fold 2, fold 3, …, fold 10.

2. Logistics Regression

# ===============================================
# LOGISTIC REGRESSION
# ==============================================

library(tidyverse)
library(tidymodels)

# 1. Reload the data fresh
master <- read_csv("Master_Feature_Table.csv")
Rows: 200 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): ID, Scan, BIRADS
dbl (22): Lesion, Class, Mu_Intra, Omega_Intra, Mu_Peri, Omega_Peri, Alpha_I...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 2. Correct data types
# Make Malignant the FIRST level so sens/spec use Malignant as the positive class
master <- master %>%
  mutate(
    Lesion = factor(Lesion),
    ID     = factor(ID),
    Scan   = factor(Scan),
    BIRADS = factor(BIRADS),
    Class  = factor(as.character(Class),
                    levels = c("1", "0"),
                    labels = c("Malignant", "Benign"))
  )

# Check outcome coding
levels(master$Class)
[1] "Malignant" "Benign"   
table(master$Class, useNA = "ifany")

Malignant    Benign 
      104        96 
# 3. Grouped 10-fold cross-validation
# Keep rf1 and rf2 from the same lesion in the same fold
set.seed(123)
lesion_folds <- group_vfold_cv(master, group = Lesion, v = 10)

# 4. Preprocessing recipe
# Exclude Lesion, ID, Scan, and BIRADS from predictors
# Remove zero-variance predictors
# Normalize numeric predictors inside each training fold only
lesion_recipe <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

# 5. Define logistic regression model
log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# 6. Combine recipe + model into a workflow
log_wf <- workflow() %>%
  add_recipe(lesion_recipe) %>%
  add_model(log_spec)

# 7. Performance metrics
# Since Malignant is the first level, sens/spec are for Malignant
class_metrics <- metric_set(roc_auc, accuracy, sens, spec)

# 8. Run grouped 10-fold cross-validation
set.seed(123)
log_res <- fit_resamples(
  log_wf,
  resamples = lesion_folds,
  metrics = class_metrics,
  control = control_resamples(save_pred = TRUE)
)

# 9. Average performance across the 10 folds
collect_metrics(log_res)
# A tibble: 4 × 6
  .metric  .estimator  mean     n std_err .config        
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy binary     0.67     10  0.0309 pre0_mod0_post0
2 roc_auc  binary     0.727    10  0.0314 pre0_mod0_post0
3 sens     binary     0.664    10  0.0545 pre0_mod0_post0
4 spec     binary     0.700    10  0.0334 pre0_mod0_post0
  • Accuracy = 0.670. About 67.0% of lesions were classified correctly on average across the 10 folds.

  • ROC AUC = 0.727. This is the best overall summary here. An AUC of about 0.73 means the model has moderate discrimination. It is clearly better than random guessing, but not strong.

  • Sensitivity = 0.664. Since Malignant is now the positive class, this means the model correctly detects about 66.4% of malignant lesions.

  • Specificity = 0.700. The model correctly identifies about 70.0% of benign lesions.

So the model is reasonably balanced, but it still misses about one-third of malignant cases, which is important in a cancer-classification setting.

3. Support Vector Machine (SVM)

Data likely have non-linear boundaries and SVM also often works well with small/medium dataset.

# ============================================================
# SUPPORT VECTOR MACHINE (SVM) WITH GROUPED 10-FOLD CV
# ============================================================

library(tidyverse)
library(tidymodels)


# 1. Reload the data fresh
master <- read_csv("Master_Feature_Table.csv", show_col_types = FALSE)

# 2. Correct data types
# Make Malignant the positive class (first level)
master <- master %>%
  mutate(
    Lesion = factor(Lesion),
    ID     = factor(ID),
    Scan   = factor(Scan),
    BIRADS = factor(BIRADS),
    Class  = factor(as.character(Class),
                    levels = c("1", "0"),
                    labels = c("Malignant", "Benign"))
  )

# Check coding
levels(master$Class)
[1] "Malignant" "Benign"   
table(master$Class, useNA = "ifany")

Malignant    Benign 
      104        96 
# 3. Grouped 10-fold cross-validation
# Keeps rf1 and rf2 from the same lesion in the same fold
set.seed(123)
lesion_folds <- group_vfold_cv(master, group = Lesion, v = 10)

# 4. Preprocessing recipe
# Exclude Lesion, ID, Scan, and BIRADS from predictors
# Remove zero-variance predictors
# Normalize numeric predictors
lesion_recipe <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

# 5. Define SVM model
# We tune both cost and rbf_sigma
svm_spec <- svm_rbf(
  cost = tune(),
  rbf_sigma = tune()
) %>%
  set_engine("kernlab") %>%
  set_mode("classification")

# 6. Combine recipe + model into a workflow
svm_wf <- workflow() %>%
  add_recipe(lesion_recipe) %>%
  add_model(svm_spec)

# 7. Performance metrics
class_metrics <- metric_set(roc_auc, accuracy, sens, spec)

# 8. Create a tuning grid
# You can increase levels later if you want a larger search
svm_grid <- grid_regular(
  cost(),
  rbf_sigma(),
  levels = 5
)

svm_grid
# A tibble: 25 × 2
        cost    rbf_sigma
       <dbl>        <dbl>
 1  0.000977 0.0000000001
 2  0.0131   0.0000000001
 3  0.177    0.0000000001
 4  2.38     0.0000000001
 5 32        0.0000000001
 6  0.000977 0.0000000316
 7  0.0131   0.0000000316
 8  0.177    0.0000000316
 9  2.38     0.0000000316
10 32        0.0000000316
# ℹ 15 more rows
# 9. Tune the SVM using grouped 10-fold CV
set.seed(123)
svm_tune_res <- tune_grid(
  svm_wf,
  resamples = lesion_folds,
  grid = svm_grid,
  metrics = class_metrics,
  control = control_grid(save_pred = TRUE)
)
maximum number of iterations reached 1.93728e-05 1.937278e-05maximum number of iterations reached 3.447101e-05 3.447096e-05maximum number of iterations reached 0.0004446822 0.0004446709maximum number of iterations reached 0.003473281 0.003471796maximum number of iterations reached 5.490815e-05 5.490803e-05maximum number of iterations reached 0.0006631877 0.0006631417maximum number of iterations reached 0.00356196 0.003576482maximum number of iterations reached 0.003333503 0.003334978maximum number of iterations reached 0.0004911366 0.0004908758maximum number of iterations reached 0.00167486 0.001678944maximum number of iterations reached 2.542258e-05 2.542255e-05maximum number of iterations reached 3.65625e-05 3.656239e-05maximum number of iterations reached 0.0004558288 0.0004558093maximum number of iterations reached 0.004035263 0.004035291maximum number of iterations reached 6.197289e-05 6.197259e-05maximum number of iterations reached 0.0007795895 0.0007795332maximum number of iterations reached 0.004197448 0.004198777maximum number of iterations reached 0.001965818 0.001967128maximum number of iterations reached 0.0004750443 0.0004747852maximum number of iterations reached 0.001636261 0.001638206maximum number of iterations reached 2.395927e-05 2.395923e-05maximum number of iterations reached 3.469817e-05 3.469804e-05maximum number of iterations reached 0.000442619 0.000442605maximum number of iterations reached 0.003673404 0.003669484maximum number of iterations reached 6.021588e-05 6.021564e-05maximum number of iterations reached 0.0007415351 0.0007414804maximum number of iterations reached 0.004347108 0.004347015maximum number of iterations reached 0.001566566 0.001569173maximum number of iterations reached 0.000414834 0.00041461maximum number of iterations reached 0.0009182723 0.0009116822maximum number of iterations reached 4.39388e-05 -5.695755e-11maximum number of iterations reached 7.628034e-05 4.950595e-11maximum number of iterations reached 0.001021588 7.685148e-09maximum number of iterations reached 0.005310394 5.246079e-06maximum number of iterations reached 0.0001169899 6.948964e-11maximum number of iterations reached 0.001824876 3.253998e-08maximum number of iterations reached 0.001631367 3.833234e-06maximum number of iterations reached 0.0002743618 3.663407e-08maximum number of iterations reached 0.0007189497 4.894897e-08maximum number of iterations reached 8.393136e-05 1.491958e-07maximum number of iterations reached 1.963565e-05 1.963562e-05maximum number of iterations reached 3.195017e-05 3.195009e-05maximum number of iterations reached 0.0004761357 0.0004761195maximum number of iterations reached 0.004616132 0.004614161maximum number of iterations reached 6.414473e-05 6.414441e-05maximum number of iterations reached 0.0007918825 0.0007918196maximum number of iterations reached 0.00432163 0.004332861maximum number of iterations reached 0.001981041 0.001983023maximum number of iterations reached 0.0006215059 0.0006211636maximum number of iterations reached 0.001763227 0.001761619maximum number of iterations reached 2.521045e-05 2.521042e-05maximum number of iterations reached 3.565756e-05 3.565745e-05maximum number of iterations reached 0.0004690722 0.0004690529maximum number of iterations reached 0.00447277 0.004466601maximum number of iterations reached 6.083459e-05 6.083427e-05maximum number of iterations reached 0.0008359464 0.0008358743maximum number of iterations reached 0.002430696 0.002444047maximum number of iterations reached 0.002123495 0.002123861maximum number of iterations reached 0.0005902443 0.0005899266maximum number of iterations reached 0.001873214 0.001850608maximum number of iterations reached 1.934658e-05 1.934655e-05maximum number of iterations reached 2.495458e-05 2.495455e-05maximum number of iterations reached 0.000381903 0.0003819024maximum number of iterations reached 0.003978042 0.003973916maximum number of iterations reached 4.519748e-05 4.519744e-05maximum number of iterations reached 0.0006037736 0.0006037594maximum number of iterations reached 0.004686471 0.004677882maximum number of iterations reached 0.003346499 0.003356669maximum number of iterations reached 0.0001654731 0.0001653866maximum number of iterations reached 0.0008264732 0.000818263maximum number of iterations reached 2.34685e-05 2.346847e-05maximum number of iterations reached 3.915876e-05 3.915859e-05maximum number of iterations reached 0.0004475298 0.0004474984maximum number of iterations reached 0.004380399 0.004377836maximum number of iterations reached 6.411736e-05 6.411681e-05maximum number of iterations reached 0.0008693925 0.0008692883maximum number of iterations reached 0.002609221 0.002612473maximum number of iterations reached 0.001217698 0.001213602maximum number of iterations reached 0.0004316407 0.0004314013maximum number of iterations reached 0.001522571 0.001528107maximum number of iterations reached 2.225618e-05 2.225615e-05maximum number of iterations reached 2.970324e-05 2.970315e-05maximum number of iterations reached 0.0003693966 0.0003693788maximum number of iterations reached 0.003097201 0.003097655maximum number of iterations reached 5.621161e-05 5.621128e-05maximum number of iterations reached 0.0007840748 0.0007840394maximum number of iterations reached 0.004550906 0.004551954maximum number of iterations reached 0.002394355 0.002390698maximum number of iterations reached 0.0003991924 0.0003989734maximum number of iterations reached 0.003063472 0.003079001maximum number of iterations reached 1.778816e-05 1.778813e-05maximum number of iterations reached 2.007336e-05 2.007328e-05maximum number of iterations reached 0.0003925202 0.0003924999maximum number of iterations reached 0.004263793 0.004261321maximum number of iterations reached 5.128423e-05 5.128388e-05maximum number of iterations reached 0.0006485365 0.000648482maximum number of iterations reached 0.002473645 0.002476248maximum number of iterations reached 0.001303802 0.001303905maximum number of iterations reached 0.0001480462 0.0001479634maximum number of iterations reached 0.00292133 0.002907439
# 10. View all tuning results
collect_metrics(svm_tune_res)
# A tibble: 100 × 8
       cost    rbf_sigma .metric  .estimator   mean     n std_err .config       
      <dbl>        <dbl> <chr>    <chr>       <dbl> <int>   <dbl> <chr>         
 1 0.000977 0.0000000001 accuracy binary     0.505     10  0.0263 pre0_mod01_po…
 2 0.000977 0.0000000001 roc_auc  binary     0.714     10  0.0410 pre0_mod01_po…
 3 0.000977 0.0000000001 sens     binary     0.936     10  0.0643 pre0_mod01_po…
 4 0.000977 0.0000000001 spec     binary     0.1       10  0.1    pre0_mod01_po…
 5 0.000977 0.0000000316 accuracy binary     0.51      10  0.0277 pre0_mod02_po…
 6 0.000977 0.0000000316 roc_auc  binary     0.719     10  0.0373 pre0_mod02_po…
 7 0.000977 0.0000000316 sens     binary     0.971     10  0.0286 pre0_mod02_po…
 8 0.000977 0.0000000316 spec     binary     0.0333    10  0.0333 pre0_mod02_po…
 9 0.000977 0.00001      accuracy binary     0.51      10  0.0277 pre0_mod03_po…
10 0.000977 0.00001      roc_auc  binary     0.745     10  0.0299 pre0_mod03_po…
# ℹ 90 more rows
# 11. Show the best SVM setting based on ROC AUC
show_best(svm_tune_res, metric = "roc_auc", n = 5)
# A tibble: 5 × 8
       cost rbf_sigma .metric .estimator  mean     n std_err .config         
      <dbl>     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
1 32          0.00316 roc_auc binary     0.791    10  0.0244 pre0_mod24_post0
2  2.38       0.00316 roc_auc binary     0.781    10  0.0228 pre0_mod19_post0
3  0.0131     0.00316 roc_auc binary     0.752    10  0.0330 pre0_mod09_post0
4  0.177      0.00316 roc_auc binary     0.752    10  0.0330 pre0_mod14_post0
5  0.000977   0.00316 roc_auc binary     0.752    10  0.0322 pre0_mod04_post0
# 12. Select the single best parameter combination
best_svm <- select_best(svm_tune_res, metric = "roc_auc")
best_svm
# A tibble: 1 × 3
   cost rbf_sigma .config         
  <dbl>     <dbl> <chr>           
1    32   0.00316 pre0_mod24_post0
# 13. Get the cross-validated metrics for that best SVM
best_svm_metrics <- collect_metrics(svm_tune_res) %>%
  inner_join(best_svm, by = c("cost", "rbf_sigma"))

best_svm_metrics
# A tibble: 4 × 9
   cost rbf_sigma .metric  .estimator  mean     n std_err .config.x    .config.y
  <dbl>     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>        <chr>    
1    32   0.00316 accuracy binary     0.69     10  0.0277 pre0_mod24_… pre0_mod…
2    32   0.00316 roc_auc  binary     0.791    10  0.0244 pre0_mod24_… pre0_mod…
3    32   0.00316 sens     binary     0.641    10  0.0413 pre0_mod24_… pre0_mod…
4    32   0.00316 spec     binary     0.753    10  0.0382 pre0_mod24_… pre0_mod…

Compared with logistic regression:

  • Accuracy improved from 0.670 to 0.690.

  • ROC AUC improved from 0.727 to 0.791

  • Sensitivity dropped slightly from 0.664 to 0.641.

  • Specificity improved from 0.700 to 0.753

So the SVM is doing a better job overall at separating benign from malignant lesions, especially in terms of ROC AUC, which is usually the most informative summary metric here. It is also better at correctly identifying benign lesions, though it gives up a little sensitivity for malignant detection.

Compared with logistic regression: Accuracy improved from 0.670 to 0.690 ROC AUC improved from 0.727 to 0.791 Sensitivity dropped slightly from 0.664 to 0.641 Specificity improved from 0.700 to 0.753 So the SVM is doing a better job overall at separating benign from malignant lesions, especially in terms of ROC AUC, which is usually the most informative summary metric here. It is also better at correctly identifying benign lesions, though it gives up a little sensitivity for malignant detection.

Thus, Logistic regression is the good baseline but SVM is the better overall model so far.

4. Decision Tree

# ===============================================
# DECISION TREE WITH GROUPED 10-FOLD CV
# ===============================================

library(tidyverse)
library(tidymodels)
library(rpart.plot)
Loading required package: rpart

Attaching package: 'rpart'
The following object is masked from 'package:dials':

    prune
# 1. Reload the data fresh
master <- read_csv("Master_Feature_Table.csv", show_col_types = FALSE)

# 2. Correct data types
# Make Malignant the positive class (first level)
master <- master %>%
  mutate(
    Lesion = factor(Lesion),
    ID     = factor(ID),
    Scan   = factor(Scan),
    BIRADS = factor(BIRADS),
    Class  = factor(as.character(Class),
                    levels = c("1", "0"),
                    labels = c("Malignant", "Benign"))
  )

# Check coding
levels(master$Class)
[1] "Malignant" "Benign"   
table(master$Class, useNA = "ifany")

Malignant    Benign 
      104        96 
# 3. Grouped 10-fold cross-validation
set.seed(123)
lesion_folds <- group_vfold_cv(master, group = Lesion, v = 10)

# 4. Recipe
# Tree models do not require normalization
tree_recipe <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_zv(all_predictors())

# 5. Define decision tree model
# Tune cost_complexity, tree_depth, and min_n
tree_spec <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart") %>%
  set_mode("classification")

# 6. Workflow
tree_wf <- workflow() %>%
  add_recipe(tree_recipe) %>%
  add_model(tree_spec)

# 7. Metrics
class_metrics <- metric_set(roc_auc, accuracy, sens, spec)

# 8. Tuning grid
tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(range = c(2L, 20L)),
  levels = 4
)

tree_grid
# A tibble: 64 × 3
   cost_complexity tree_depth min_n
             <dbl>      <int> <int>
 1    0.0000000001          1     2
 2    0.0000001             1     2
 3    0.0001                1     2
 4    0.1                   1     2
 5    0.0000000001          5     2
 6    0.0000001             5     2
 7    0.0001                5     2
 8    0.1                   5     2
 9    0.0000000001         10     2
10    0.0000001            10     2
# ℹ 54 more rows
# 9. Tune decision tree
set.seed(123)
tree_tune_res <- tune_grid(
  tree_wf,
  resamples = lesion_folds,
  grid = tree_grid,
  metrics = class_metrics,
  control = control_grid(save_pred = TRUE)
)

# 10. Optional: inspect notes
show_notes(tree_tune_res)
Great job! No notes to show.
# 11. View all tuning results
collect_metrics(tree_tune_res)
# A tibble: 256 × 9
   cost_complexity tree_depth min_n .metric  .estimator  mean     n std_err
             <dbl>      <int> <int> <chr>    <chr>      <dbl> <int>   <dbl>
 1    0.0000000001          1     2 accuracy binary     0.635    10  0.0183
 2    0.0000000001          1     2 roc_auc  binary     0.647    10  0.0220
 3    0.0000000001          1     2 sens     binary     0.455    10  0.0570
 4    0.0000000001          1     2 spec     binary     0.839    10  0.0563
 5    0.0000000001          1     8 accuracy binary     0.635    10  0.0183
 6    0.0000000001          1     8 roc_auc  binary     0.647    10  0.0220
 7    0.0000000001          1     8 sens     binary     0.455    10  0.0570
 8    0.0000000001          1     8 spec     binary     0.839    10  0.0563
 9    0.0000000001          1    14 accuracy binary     0.635    10  0.0183
10    0.0000000001          1    14 roc_auc  binary     0.647    10  0.0220
# ℹ 246 more rows
# ℹ 1 more variable: .config <chr>
# 12. Best models by ROC AUC
show_best(tree_tune_res, metric = "roc_auc", n = 5)
# A tibble: 5 × 9
  cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
            <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
1    0.0000000001         10    20 roc_auc binary     0.718    10  0.0395
2    0.0000000001         15    20 roc_auc binary     0.718    10  0.0395
3    0.0000001            10    20 roc_auc binary     0.718    10  0.0395
4    0.0000001            15    20 roc_auc binary     0.718    10  0.0395
5    0.0001               10    20 roc_auc binary     0.718    10  0.0395
# ℹ 1 more variable: .config <chr>
# 13. Select the best parameter combination
best_tree <- select_best(tree_tune_res, metric = "roc_auc")
best_tree
# A tibble: 1 × 4
  cost_complexity tree_depth min_n .config         
            <dbl>      <int> <int> <chr>           
1    0.0000000001         10    20 pre0_mod12_post0
# 14. Metrics for the best decision tree
best_tree_metrics <- collect_metrics(tree_tune_res) %>%
  inner_join(best_tree, by = c("cost_complexity", "tree_depth", "min_n"))

best_tree_metrics
# A tibble: 4 × 10
  cost_complexity tree_depth min_n .metric  .estimator  mean     n std_err
            <dbl>      <int> <int> <chr>    <chr>      <dbl> <int>   <dbl>
1    0.0000000001         10    20 accuracy binary     0.67     10  0.0335
2    0.0000000001         10    20 roc_auc  binary     0.718    10  0.0395
3    0.0000000001         10    20 sens     binary     0.671    10  0.0468
4    0.0000000001         10    20 spec     binary     0.672    10  0.0448
# ℹ 2 more variables: .config.x <chr>, .config.y <chr>
# 15. Finalize workflow with best parameters
final_tree_wf <- finalize_workflow(tree_wf, best_tree)

# 16. Fit finalized tree to full data
final_tree_fit <- fit(final_tree_wf, data = master)

# 17. Plot the final tree
final_tree_obj <- extract_fit_parsnip(final_tree_fit)$fit

rpart.plot(
  final_tree_obj,
  type = 2,
  extra = 104,
  fallen.leaves = TRUE,
  box.palette = "Blues",
  shadow.col = "gray",
  nn = TRUE,
  main = "Final Decision Tree for Lesion Classification"
)
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
    Call rpart.plot with roundint=FALSE,
    or rebuild the rpart model with model=TRUE.

# 18. Optional: tuning plot
collect_metrics(tree_tune_res) %>%
  filter(.metric == "roc_auc") %>%
  ggplot(aes(x = tree_depth, y = mean, color = factor(min_n))) +
  geom_line() +
  geom_point(size = 2) +
  facet_wrap(~ cost_complexity, scales = "free_x") +
  theme_minimal() +
  labs(
    title = "Decision Tree Tuning Results",
    x = "Tree Depth",
    y = "Mean ROC AUC",
    color = "min_n"
  )

  • The best ROC AUC is only around 0.72.

  • Very large cost_complexity (0.1) collapses the tree too much, which is why that panel looks almost flat and weaker.

So the tuning plot suggests: A single decision tree has only moderate predictive ability and does not match the performance of the more flexible ensemble methods.

As for the tree: peri-tumoral features are very influential, some intra-tumoral features also matter, both distributional and texture features contribute.

The decision tree identified Omega_Peri as the primary splitting variable, indicating that peri-tumoral intensity-related information plays a major role in lesion classification. Subsequent splits involved U_Peri, Alpha_Intra, Correlation_Peri, Theta_Intra, and Contrast_Peri.

5. Random Forest

# ============================================================
# RANDOM FOREST WITH GROUPED 10-FOLD CV
# ============================================================

library(tidyverse)
library(tidymodels)
library(ranger)

# Reload data
master <- read_csv("Master_Feature_Table.csv", show_col_types = FALSE)

# Correct data types
master <- master %>%
  mutate(
    Lesion = factor(Lesion),
    ID     = factor(ID),
    Scan   = factor(Scan),
    BIRADS = factor(BIRADS),
    Class  = factor(as.character(Class),
                    levels = c("1", "0"),
                    labels = c("Malignant", "Benign"))
  )

# Grouped 10-fold CV
set.seed(123)
lesion_folds <- group_vfold_cv(master, group = Lesion, v = 10)

# Recipe
rf_recipe <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_zv(all_predictors())

# Random forest model
rf_spec <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 1000
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

# Workflow
rf_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_spec)

# Metrics
class_metrics <- metric_set(roc_auc, accuracy, sens, spec)

# Count only actual predictors
p <- master %>%
  select(-Class, -Lesion, -ID, -Scan, -BIRADS) %>%
  ncol()

p
[1] 20
# Tuning grid
rf_grid <- grid_regular(
  mtry(range = c(2L, p)),
  min_n(range = c(2L, 20L)),
  levels = 5
)

# Tune random forest
set.seed(123)
rf_tune_res <- tune_grid(
  rf_wf,
  resamples = lesion_folds,
  grid = rf_grid,
  metrics = class_metrics,
  control = control_grid(save_pred = TRUE)
)

# Optional: inspect warnings
show_notes(rf_tune_res)
Great job! No notes to show.
# Best settings
show_best(rf_tune_res, metric = "roc_auc", n = 5)
# A tibble: 5 × 8
   mtry min_n .metric .estimator  mean     n std_err .config         
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
1    20     6 roc_auc binary     0.823    10  0.0298 pre0_mod22_post0
2    20    20 roc_auc binary     0.821    10  0.0301 pre0_mod25_post0
3     6    20 roc_auc binary     0.821    10  0.0310 pre0_mod10_post0
4    20     2 roc_auc binary     0.820    10  0.0285 pre0_mod21_post0
5    11     2 roc_auc binary     0.820    10  0.0280 pre0_mod11_post0
best_rf <- select_best(rf_tune_res, metric = "roc_auc")
best_rf
# A tibble: 1 × 3
   mtry min_n .config         
  <int> <int> <chr>           
1    20     6 pre0_mod22_post0
# Metrics for best model
best_rf_metrics <- collect_metrics(rf_tune_res) %>%
  inner_join(best_rf, by = c("mtry", "min_n"))

best_rf_metrics
# A tibble: 4 × 9
   mtry min_n .metric  .estimator  mean     n std_err .config.x        .config.y
  <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>            <chr>    
1    20     6 accuracy binary     0.725    10  0.0227 pre0_mod22_post0 pre0_mod…
2    20     6 roc_auc  binary     0.823    10  0.0298 pre0_mod22_post0 pre0_mod…
3    20     6 sens     binary     0.723    10  0.0392 pre0_mod22_post0 pre0_mod…
4    20     6 spec     binary     0.749    10  0.0520 pre0_mod22_post0 pre0_mod…

Plot the tuning results

library(ggplot2)

autoplot(rf_tune_res) +
  theme_minimal() +
  labs(title = "Random Forest Tuning Results")

  • ROC AUC is highest when mtry is fairly large, around 18 to 24.

  • The best curves also seem to come from smaller-to-moderate min_n, especially around 6 or 20, depending on the exact metric.

  • Sensitivity and specificity move in opposite directions a bit, which is normal. Some settings catch more malignant lesions, while others do better at recognizing benign lesions.

  • Overall, random forest seems to be performing in the low 0.82 range for ROC AUC, which is better than your logistic model and also a little better than your SVM.

The random forest appears to capture nonlinear structure and feature interactions better than logistic regression and SVM, yielding the strongest overall discrimination so far.

Plot variable importance

# Finalize workflow with best parameters
final_rf_wf <- finalize_workflow(rf_wf, best_rf)

# Fit finalized random forest to the full dataset
final_rf_fit <- fit(final_rf_wf, data = master)

# Extract variable importance from ranger fit
rf_importance <- extract_fit_parsnip(final_rf_fit)$fit$variable.importance

# Convert to tibble
rf_importance_tbl <- tibble(
  Feature = names(rf_importance),
  Importance = as.numeric(rf_importance)
) %>%
  arrange(desc(Importance))

# View importance table
rf_importance_tbl
# A tibble: 20 × 2
   Feature           Importance
   <chr>                  <dbl>
 1 Omega_Peri             16.8 
 2 Alpha_Intra             8.35
 3 Theta_Peri              7.84
 4 U_Peri                  7.50
 5 Theta_Intra             5.82
 6 Correlation_Intra       5.07
 7 Correlation_Peri        4.98
 8 X_Peri                  4.46
 9 Homogeneity_Peri        4.41
10 Omega_Intra             4.36
11 X_Intra                 4.27
12 U_Intra                 3.73
13 Homogeneity_Intra       3.00
14 Energy_Intra            2.64
15 Mu_Intra                2.55
16 Energy_Peri             2.29
17 Alpha_Peri              1.86
18 Mu_Peri                 1.57
19 Contrast_Intra          1.40
20 Contrast_Peri           1.22
# Plot top 10 variable importance
rf_importance_tbl %>%
  slice_max(order_by = Importance, n = 10) %>%
  ggplot(aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Top 10 Variable Importance Scores",
    x = "Feature",
    y = "Importance"
  )

Peri-tumoral features are especially important, since several top variables come from the peri region.Omega_Peri stands out clearly as the strongest predictor.

Both distribution-based parameters (Omega, Theta, Alpha, U, X) and texture-based measures (Correlation, Homogeneity) matter.This supports your earlier idea that classification improves when the model can combine multiple lesion characteristics instead of relying on one feature at a time.

Which is the best model?

# ============================================================
# FINAL MODEL COMPARISON TABLE
# ============================================================

library(tidyverse)

# 1. Logistic regression results
log_tbl <- collect_metrics(log_res) %>%
  select(.metric, mean) %>%
  mutate(Model = "Logistic Regression")

# 2. SVM results
svm_tbl <- best_svm_metrics %>%
  select(.metric, mean) %>%
  mutate(Model = "SVM (RBF)")

# 3. Decision tree results
tree_tbl <- best_tree_metrics %>%
  select(.metric, mean) %>%
  mutate(Model = "Decision Tree")

# 4. Random forest results
rf_tbl <- best_rf_metrics %>%
  select(.metric, mean) %>%
  mutate(Model = "Random Forest")

# 5. Combine all models
model_comparison <- bind_rows(log_tbl, svm_tbl, tree_tbl, rf_tbl) %>%
  select(Model, .metric, mean) %>%
  pivot_wider(names_from = .metric, values_from = mean) %>%
  relocate(Model) %>%
  arrange(desc(roc_auc))

model_comparison %>%
  mutate(
    accuracy = round(accuracy, 2),
    roc_auc  = round(roc_auc, 2),
    sens     = round(sens, 2),
    spec     = round(spec, 2)
  )
# A tibble: 4 × 5
  Model               accuracy roc_auc  sens  spec
  <chr>                  <dbl>   <dbl> <dbl> <dbl>
1 Random Forest           0.72    0.82  0.72  0.75
2 SVM (RBF)               0.69    0.79  0.64  0.75
3 Logistic Regression     0.67    0.73  0.66  0.7 
4 Decision Tree           0.67    0.72  0.67  0.67

LOGISTIC + RANDOM FOREST FOR INTRA / PERI / COMBINED

# ==================================================
# COMPARISON
# LOGISTIC REGRESSION + RANDOM FOREST
# INTRA ONLY vs PERI ONLY vs COMBINED
# ==================================================

library(tidyverse)
library(tidymodels)
library(ranger)

# 1. Reload data
master <- read_csv("Master_Feature_Table.csv", show_col_types = FALSE)

# 2. Correct data types
master <- master %>%
  mutate(
    Lesion = factor(Lesion),
    ID     = factor(ID),
    Scan   = factor(Scan),
    BIRADS = factor(BIRADS),
    Class  = factor(as.character(Class),
                    levels = c("1", "0"),
                    labels = c("Malignant", "Benign"))
  )

# 3. Grouped 10-fold CV
set.seed(123)
lesion_folds <- group_vfold_cv(master, group = Lesion, v = 10)

# 4. Identify feature groups
intra_vars <- names(master)[str_detect(names(master), "_Intra$")]
peri_vars  <- names(master)[str_detect(names(master), "_Peri$")]

# 5. Metrics
class_metrics <- metric_set(roc_auc, accuracy, sens, spec)

# ============================================================
# A. LOGISTIC REGRESSION
# ============================================================

log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# ----------------------------
# Intra-only logistic
# ----------------------------
log_recipe_intra <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_rm(-all_of(c("Class", "Lesion", "ID", "Scan", "BIRADS", intra_vars))) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

log_wf_intra <- workflow() %>%
  add_recipe(log_recipe_intra) %>%
  add_model(log_spec)

set.seed(123)
log_res_intra <- fit_resamples(
  log_wf_intra,
  resamples = lesion_folds,
  metrics = class_metrics,
  control = control_resamples(save_pred = TRUE)
)
Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
  <https://tidyselect.r-lib.org/reference/faq-selection-context.html>
log_metrics_intra <- collect_metrics(log_res_intra) %>%
  select(.metric, mean) %>%
  mutate(Model = "Logistic", Feature_Set = "Intra only")

# ----------------------------
# Peri-only logistic
# ----------------------------
log_recipe_peri <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_rm(-all_of(c("Class", "Lesion", "ID", "Scan", "BIRADS", peri_vars))) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

log_wf_peri <- workflow() %>%
  add_recipe(log_recipe_peri) %>%
  add_model(log_spec)

set.seed(123)
log_res_peri <- fit_resamples(
  log_wf_peri,
  resamples = lesion_folds,
  metrics = class_metrics,
  control = control_resamples(save_pred = TRUE)
)

log_metrics_peri <- collect_metrics(log_res_peri) %>%
  select(.metric, mean) %>%
  mutate(Model = "Logistic", Feature_Set = "Peri only")

# ----------------------------
# Combined logistic
# ----------------------------
log_recipe_combined <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

log_wf_combined <- workflow() %>%
  add_recipe(log_recipe_combined) %>%
  add_model(log_spec)

set.seed(123)
log_res_combined <- fit_resamples(
  log_wf_combined,
  resamples = lesion_folds,
  metrics = class_metrics,
  control = control_resamples(save_pred = TRUE)
)

log_metrics_combined <- collect_metrics(log_res_combined) %>%
  select(.metric, mean) %>%
  mutate(Model = "Logistic", Feature_Set = "Intra + Peri")

# ==================================================
# B. RANDOM FOREST
# ==================================================

rf_spec <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 1000
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

# ----------------------------
# Intra-only random forest
# ----------------------------
rf_recipe_intra <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_rm(-all_of(c("Class", "Lesion", "ID", "Scan", "BIRADS", intra_vars))) %>%
  step_zv(all_predictors())

rf_wf_intra <- workflow() %>%
  add_recipe(rf_recipe_intra) %>%
  add_model(rf_spec)

p_intra <- length(intra_vars)

rf_grid_intra <- grid_regular(
  mtry(range = c(2L, p_intra)),
  min_n(range = c(2L, 20L)),
  levels = 5
)

set.seed(123)
rf_tune_intra <- tune_grid(
  rf_wf_intra,
  resamples = lesion_folds,
  grid = rf_grid_intra,
  metrics = class_metrics,
  control = control_grid(save_pred = TRUE)
)

best_rf_intra <- select_best(rf_tune_intra, metric = "roc_auc")

rf_metrics_intra <- collect_metrics(rf_tune_intra) %>%
  inner_join(best_rf_intra, by = c("mtry", "min_n")) %>%
  select(.metric, mean) %>%
  mutate(Model = "Random Forest", Feature_Set = "Intra only")

# ----------------------------
# Peri-only random forest
# ----------------------------
rf_recipe_peri <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_rm(-all_of(c("Class", "Lesion", "ID", "Scan", "BIRADS", peri_vars))) %>%
  step_zv(all_predictors())

rf_wf_peri <- workflow() %>%
  add_recipe(rf_recipe_peri) %>%
  add_model(rf_spec)

p_peri <- length(peri_vars)

rf_grid_peri <- grid_regular(
  mtry(range = c(2L, p_peri)),
  min_n(range = c(2L, 20L)),
  levels = 5
)

set.seed(123)
rf_tune_peri <- tune_grid(
  rf_wf_peri,
  resamples = lesion_folds,
  grid = rf_grid_peri,
  metrics = class_metrics,
  control = control_grid(save_pred = TRUE)
)

best_rf_peri <- select_best(rf_tune_peri, metric = "roc_auc")

rf_metrics_peri <- collect_metrics(rf_tune_peri) %>%
  inner_join(best_rf_peri, by = c("mtry", "min_n")) %>%
  select(.metric, mean) %>%
  mutate(Model = "Random Forest", Feature_Set = "Peri only")

# ----------------------------
# Combined random forest
# ----------------------------
rf_recipe_combined <- recipe(Class ~ ., data = master) %>%
  update_role(Lesion, ID, Scan, BIRADS, new_role = "ID") %>%
  step_zv(all_predictors())

rf_wf_combined <- workflow() %>%
  add_recipe(rf_recipe_combined) %>%
  add_model(rf_spec)

p_combined <- master %>%
  select(-Class, -Lesion, -ID, -Scan, -BIRADS) %>%
  ncol()

rf_grid_combined <- grid_regular(
  mtry(range = c(2L, p_combined)),
  min_n(range = c(2L, 20L)),
  levels = 5
)

set.seed(123)
rf_tune_combined <- tune_grid(
  rf_wf_combined,
  resamples = lesion_folds,
  grid = rf_grid_combined,
  metrics = class_metrics,
  control = control_grid(save_pred = TRUE)
)

best_rf_combined <- select_best(rf_tune_combined, metric = "roc_auc")

rf_metrics_combined <- collect_metrics(rf_tune_combined) %>%
  inner_join(best_rf_combined, by = c("mtry", "min_n")) %>%
  select(.metric, mean) %>%
  mutate(Model = "Random Forest", Feature_Set = "Intra + Peri")

# ============================================================
# FINAL COMPARISON TABLE
# ============================================================

feature_set_comparison <- bind_rows(
  log_metrics_intra,
  log_metrics_peri,
  log_metrics_combined,
  rf_metrics_intra,
  rf_metrics_peri,
  rf_metrics_combined
) %>%
  select(Model, Feature_Set, .metric, mean) %>%
  pivot_wider(names_from = .metric, values_from = mean) %>%
  mutate(
    accuracy = round(accuracy, 3),
    roc_auc  = round(roc_auc, 3),
    sens     = round(sens, 3),
    spec     = round(spec, 3)
  ) %>%
  arrange(Model, desc(roc_auc))

feature_set_comparison
# A tibble: 6 × 6
  Model         Feature_Set  accuracy roc_auc  sens  spec
  <chr>         <chr>           <dbl>   <dbl> <dbl> <dbl>
1 Logistic      Intra + Peri    0.67    0.727 0.664 0.7  
2 Logistic      Peri only       0.675   0.72  0.659 0.696
3 Logistic      Intra only      0.645   0.712 0.631 0.707
4 Random Forest Intra + Peri    0.725   0.823 0.723 0.749
5 Random Forest Intra only      0.675   0.817 0.688 0.711
6 Random Forest Peri only       0.66    0.711 0.625 0.715

To directly evaluate the contribution of peri-tumoral information, we compared intra-only, peri-only, and combined intra-plus-peri predictor sets using logistic regression and random forest under grouped 10-fold cross-validation.

For logistic regression, the combined feature set achieved the highest ROC AUC (0.727), followed by peri-only (0.720) and intra-only (0.712).

For random forest, the combined feature set again performed best, with ROC AUC of 0.823, compared with 0.817 for intra-only and 0.711 for peri-only.

These results indicate that peri-tumoral information is most useful when combined with intra-tumoral features, rather than used in isolation. Overall, the strongest classification performance was obtained by the random forest model using the combined intra-plus-peri feature set.