OASBUD Lesion Classification.

Author

My (Mimi) Tran

I. Data Wrangling and EDA

Libraries and loading data

# Library
library(tidyverse)
library(tidymodels)
library(dplyr)
library(recipes)

# Load data
master <- read_csv("Master_Feature_Table.csv")
head(master)

# A tibble: 6 × 25
  Lesion ID    Scan  Class BIRADS Mu_Intra Omega_Intra Mu_Peri Omega_Peri
   <dbl> <chr> <chr> <dbl> <chr>     <dbl>       <dbl>   <dbl>      <dbl>
1      1 1mm   rf1       0 3         0.556     477917.   0.373   4652261.
2      1 1mm   rf2       0 3         0.779     215309.   0.331   4884297.
3      2 2mm   rf1       0 3         0.652     256974.   0.341   8680795.
4      2 2mm   rf2       0 3         0.725     165450.   0.448   3775708.
5      3 3als  rf1       0 4a        0.731     161513.   0.379   2919763.
6      3 3als  rf2       0 4a        0.588     289491.   0.383   7814134.
# ℹ 16 more variables: Alpha_Intra <dbl>, Theta_Intra <dbl>, Alpha_Peri <dbl>,
#   Theta_Peri <dbl>, X_Intra <dbl>, U_Intra <dbl>, X_Peri <dbl>, U_Peri <dbl>,
#   Contrast_Intra <dbl>, Contrast_Peri <dbl>, Correlation_Intra <dbl>,
#   Correlation_Peri <dbl>, Energy_Intra <dbl>, Energy_Peri <dbl>,
#   Homogeneity_Intra <dbl>, Homogeneity_Peri <dbl>

Correct data type: `Class` , `ID` and `BIRADS` columns are currently numbers or characters, but for ML in R, they need to be factors(categorical variables).

 master <- master %>% mutate(
   ID = as.factor(ID),
   Class = as.factor(Class),
   BIRADS = as.factor(BIRADS)
 )
 str(master[, c("ID", "BIRADS", "Class")])

tibble [200 × 3] (S3: tbl_df/tbl/data.frame)
 $ ID    : Factor w/ 99 levels "102ks","10jw",..: 37 37 54 54 61 61 70 70 80 80 ...
 $ BIRADS: Factor w/ 5 levels "3","4a","4b",..: 1 1 1 1 2 2 5 5 3 3 ...
 $ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...

Exploratory Data Analysis:

Class check.

First, we need to know how many Benign (0) vs. Malignant (1) cases we have, and how they align with the BIRADS scores.

# count of class 0 and 1
master_count <- master %>% count(Class)
master_count

# A tibble: 2 × 2
  Class     n
  <fct> <int>
1 0        96
2 1       104

# see how does class relate to birads
master_count.birad <- master%>% count(BIRADS, Class)
master_count.birad

# A tibble: 8 × 3
  BIRADS Class     n
  <fct>  <fct> <int>
1 3      0        62
2 4a     0        18
3 4a     1         4
4 4b     0        14
5 4b     1        18
6 4c     1        34
7 5      0         2
8 5      1        48

There are 96 Benign case and 102 Malignant cases. Most of the benign cases ( Class 0) align with lower `BIRADS` and malignant cases (Class 1) align with higher `BIRADS` . However, there will be some challenge in BIRADS 4a and 4b. Most of 4a are benign but some are malignant and most of 4b are malignant but some are benign. This is a good place to use Kolmogorov-Smirnov Test to see whether or not their distributions (4a vs 4b) are distinct.

The anomaly here is there are two cases of benign that have BIRADS 5 ~ this is false positive. We can use Proportion of Rejections test to evaluate this further.

Next, we want to check normality for all the features for each class.

master_long_classed <- master %>%
  select(Class, where(is.numeric)) %>%
  select(-any_of(c("ID", "Lesion", "Scan"))) %>% 
  pivot_longer(cols = -Class, names_to = "feature", values_to = "value")

ggplot(master_long_classed, aes(x = value, fill = factor(Class))) +
  geom_density(alpha = 0.5) +
  facet_wrap(~feature, scales = "free") +
  scale_fill_manual(values = c("0" = "green", "1" = "red")) +
  theme_minimal() +
  theme(legend.position = "bottom")

Looking at density plots above, we see that distributions that exhibit strong seperations between the Benign (Green) and Malignant (Red) classes are: distributions: U_Intra, U_Peri, X_Intra, X_Peri, Mu_Intra, Alpha_Intra. This visual gap provides the groundwork for the Kolmogorov-Smirnov (KS) test.

II. Test Statistics

1. Two-Sample Kolmogorov-Smirnov (KS) Test

Before we head to Machine Learning models, we will use the Two-Sample Kolmogorov-Smirnov (KS) Test.

Suppose we have two samples size \(m\) and size \(n\)The Two-Sample Kolmogorov-Smirnov (KS) Test use to test whether or not two samples came from the same distribution. The null hypothesis is \(H_0\): both samples (Class 0 and Class 1) come from a population with the same distribution. The test statistics is D - statistics which has values between 0 and 1. The higher the D the more separated “Green” and “Red” curves are. And ultimately if the p-value < 0.05, we reject \(H_0\) and say there is a statistically significant difference between Benign and Malignant for that feature.

# Create a summary table of KS results for all numeric features
ks_summary <- master %>%
  select(where(is.numeric)) %>%
  select(-any_of(c("Class", "ID", "Scan", "BIRADS", "Lesion"))) %>%
  map_df(~{
    test <- ks.test(.x[master$Class == 0], .x[master$Class == 1])
    data.frame(D_statistic = test$statistic, p_value = test$p.value)
  }, .id = "Feature") %>%
  arrange(desc(D_statistic)) # put the most significant at the top 

print(ks_summary)

                 Feature D_statistic      p_value
D...1         Omega_Peri  0.45753205 5.694600e-10
D...2         Theta_Peri  0.41586538 2.850998e-08
D...3             X_Peri  0.40705128 6.179996e-08
D...4             U_Peri  0.39983974 1.147959e-07
D...5            X_Intra  0.33493590 1.753000e-05
D...6            U_Intra  0.33493590 1.753000e-05
D...7        Omega_Intra  0.30448718 1.335927e-04
D...8        Theta_Intra  0.26362179 1.482570e-03
D...9        Alpha_Intra  0.20112179 2.957408e-02
D...10 Correlation_Intra  0.19471154 3.838945e-02
D...11      Energy_Intra  0.16746795 1.057806e-01
D...12  Correlation_Peri  0.15224359 1.744454e-01
D...13          Mu_Intra  0.14102564 2.444385e-01
D...14           Mu_Peri  0.14022436 2.501625e-01
D...15       Energy_Peri  0.13782051 2.679070e-01
D...16        Alpha_Peri  0.11778846 4.503562e-01
D...17  Homogeneity_Peri  0.10897436 5.481896e-01
D...18 Homogeneity_Intra  0.08173077 8.578663e-01
D...19     Contrast_Peri  0.06009615 9.868820e-01
D...20    Contrast_Intra  0.05929487 9.885425e-01

For those features that are statistically significant, we identify them as the most viable biomarkers for the classification task.

2. Class 0 Reference Distributions and Evaluation of Malignant Rejection Rates

The procedure:

For each feature (Nakagami, Gamma, etc.), we take all Benign scans and turn their parameters into curves (plug the parameters into their distribution’s pdf). Average those curves to create one single distribution.
Once we have the single distribution data we decide where “normal” ends and “non-normal” begins. And how do we do this? we marks the outer 5% of the curve as rejection region. If a lesion falls within this rejection region, it is statistically rejected because it doesn’t look like typical benign lesion.
Next, we test all 104 malignant observations. We look at where their values land on the x-axis relative to single distribution data.

If a malignant lands on that 5% rejection region, the test is a success - it “found” the cancer.

If it lands in the normal region, the test “missed” it.
Finally, we calculate the proportion of rejection. We want to know: What proportion of the malignant cases did the single distribution correctly find as being not benign.

This test will let us know which distribution (Nakagami, Gamma, and HK) is the better tool for this analysis.

Let’s start with Nakagami PDF.

Nakagami PDF:

\(X \sim Nakagami(\mu, \Omega)\)

\[ f(x)=\frac{2m^m}{\Gamma(m)\Omega^m}x^{2m-1}\exp\bigg(-\frac{mx^2}{\Omega}\bigg) \]

#define Nakagami PDF:
nakagami_pdf <- function(x, mu, omega){
  term1 <- (2*mu^mu)/(gamma(mu)*omega^mu)
  term2 <- x^(2*mu-1)
  term3<-exp(-(mu/omega)*x^2)
  return(term1*term2*term3)
}

# filter class 0
class0_data <- master %>% filter(Class==0) 

#head(class0_dat)

#Nakagami table:
# Since both mu features are not statistcally significant, we use the average
avg_mu_intra_0 <- mean(class0_data$Mu_Intra)

avg_mu_peri_0 <- mean(class0_data$Mu_Peri)

class0_nakagami <- class0_data %>% 
  select( M= Mu_Peri, Omega = Omega_Peri)

# Table for INTRA region
class0_nakagami_intra <- class0_data %>%
  select(Omega = Omega_Intra) %>% 
  mutate(Mu = avg_mu_intra_0)

# Table for PERI region
class0_nakagami_peri <- class0_data %>% 
  select(Omega = Omega_Peri) %>% 
  mutate(Mu = avg_mu_peri_0)

 

# Define x-axis: (for pixel: 0=black, to 255 = white)
# checking range of class0_nakagami_peri and class0_nakagami_intra values to get idea of x_axis range
summary(class0_nakagami_peri$Mu)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3908  0.3908  0.3908  0.3908  0.3908  0.3908

summary(class0_nakagami_peri$Omega)# max is roughly 45 millions, Omega is sqrt(intensity) which is 6764.487

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  921804  3537296  5716001  8075811  9036075 45758278

max_intensity <- 7000
x_axis <- seq(1, max_intensity, length.out=2000)

#Peri
peri_pdf <- sapply(1:nrow(class0_nakagami_peri), function(i) {
  nakagami_pdf(x_axis, 
               mu = class0_nakagami_peri$Mu[i], 
               omega = class0_nakagami_peri$Omega[i])
})

#averaging to get a single per
single_peri_pdf <- rowMeans(peri_pdf)
#head(single_peri_pdf)

# Intra
intra_pdf <- sapply(1:nrow(class0_nakagami_intra), function(i) {
  nakagami_pdf(x_axis, 
               mu = class0_nakagami_intra$Mu[i], 
               omega = class0_nakagami_intra$Omega[i])
})
single_intra_pdf <- rowMeans(intra_pdf)
#head(single_intra_pdf)

# plot to see
plot(x_axis, single_peri_pdf, type="l", col="blue", lwd=3,
     main = "Benign Template",
     xlab = "Pixel Intensity (Amplitude)",
     ylab = "Probability")
lines(x_axis, single_intra_pdf, col="red", lwd=3)

The blue curve is the peri tumoral region (tumor boundary) and the red curve represents the intra tumoral region (tumor center). From the gragh we see that the blue curve spikes around 0 (\(\mu <0.5\)). This characterizes Pre-Rayleig scattering - indicating regions with low scatterer density (dark) and irregular area at the tumor edge. The red curve peaks away from 0 (\(\mu >0.5\)), maybe near the value 500, indicates characteristic of Rayleigh or Post-Rayleigh scattering. This suggests the center of benign tumors is brighter and contains more regular patterns compared to the edges.

Thus the strong visual separation between Peri and Intra templates confirm that statistical “characters” of tissue varies by region. Therefore, these distinct Class 0 provides a well baseline for calculating the rejection rates of Class 1

#Find the 95% rejection threshold
# Function to calculate where the area under the curve hits 95%
get_cutoff <- function(x_vals, pdf_curve) {
  step_size <- x_vals[2] - x_vals[1]
  cumulative_area <- cumsum(pdf_curve * step_size)
  
  # normalize to ensure max is exactly 1.0
  cumulative_area <- cumulative_area / max(cumulative_area)
  
  # find the first X value where the area passes 0.95
  cutoff_index <- which(cumulative_area >= 0.95)[1]
  
  return(x_vals[cutoff_index])
}

# Calculate cutoffs for your generated templates
cutoff_peri <- get_cutoff(x_axis, single_peri_pdf)
cutoff_intra <- get_cutoff(x_axis, single_intra_pdf)

cat("Benign Cutoff Threshold (Peri):", round(cutoff_peri, 2), "\n")

Benign Cutoff Threshold (Peri): 4993.78

cat("Benign Cutoff Threshold (Intra):", round(cutoff_intra, 2), "\n")

Benign Cutoff Threshold (Intra): 1338.48

# Test the malignant cases (Class 1)

class1_data <- master %>% filter(Class == 1)

#Convert Omega (Power) to Amplitude
# our data is Omega (Intensity^2).
# we must take the Square Root of Omega to compare apples to apples.
malig_scores_peri <- sqrt(class1_data$Omega_Peri)
malig_scores_intra <- sqrt(class1_data$Omega_Intra)

# Check which malignant cases fall past the benign cutoff
# TRUE = Rejected (detected as anomaly), FALSE = Accepted (missed)
rejected_peri <- malig_scores_peri > cutoff_peri
rejected_intra <- malig_scores_intra > cutoff_intra

# Calculate the percentage
rate_peri <- mean(rejected_peri) * 100
rate_intra <- mean(rejected_intra) * 100


cat("FINAL NAKAGAMI REJECTION RATES:\n")

FINAL NAKAGAMI REJECTION RATES:

cat("Peri-Tumoral Detection Rate:", round(rate_peri, 2), "%\n")

Peri-Tumoral Detection Rate: 0.96 %

cat("Intra-Tumoral Detection Rate:", round(rate_intra, 2), "%\n")

Intra-Tumoral Detection Rate: 2.88 %

# visualize the cutoff lines
plot(x_axis, single_peri_pdf, type="l", col="blue", lwd=3,
     main = "Benign Template",
     xlab = "Pixel Intensity (Amplitude)",
     ylab = "Probability")
lines(x_axis, single_intra_pdf, col="red", lwd=3)
abline(v = cutoff_peri, col = "blue", lty = 2, lwd = 2)
abline(v = cutoff_intra, col = "red", lty = 2, lwd = 2)
legend("topright", 
       legend = c("Peri Template", "Intra Template", "95% Cutoff"), 
       col = c("blue", "red", "black"), 
       lty = c(1, 1, 2), lwd = 2)

From the rejection result, Nakagami model failed to distinguish between Benign and Malignant tumor: Peri Detection Rate: 0.96%, meaning uut of roughly 100 cancer cases, this test only caught 1. Intra Detection Rate: 2.88% - it caught roughly 3 out of 100 cancer cases. After all, both blue curve and red curve have wide long right tail - meaning benign and malignant tumors can be anything from 0 to a few thousands amplitude, provided poor discrimination between benign and malignant lesions.

Since the simple Nakagami model failed to catch the cancer, we can move on to Gamma and HK to see if they can capture the “clustering” of cancer cells better.

Gamma PDF:

\(X \sim Gamma(\alpha, \lambda)\)

\[ f(x)=\dfrac{x^{\alpha-1}e^{-\lambda x}\lambda^\alpha}{\Gamma(\alpha)}, \;\text{for}; x>0, \alpha, \lambda >0 \]

to be cont…

III. Group Split and Normalization.

To prevent data from accidentally “memorizing” characteristics of the features, we perform the split first before normalizing dataset.

Since this is a small dataset, we use K-Fold Cross-Validation. It is more reliable than training/testing.

set.seed(123)
lesion_folds <- group_vfold_cv(master, group = ID, v=10) # we group by ID
# 100 lesion, each with 2 scans = 200 rows,we put 10 lesions (20 rows) aside and then use 
# the remaining 180 rows to train, then it tests on 10 lesions above. 
#The process repeats 10 times and finally average the accuracy across all 10 tries.

# Normalizing

lesion_recipe <- recipe(Class ~ ., data = master) %>%
  update_role(ID, Scan, BIRADS, new_role = "metadata") %>% 
  step_normalize(all_predictors()) # features are not normalize yet, since I dont want it calculate mean, variance now and might "memorize" the global distribution.