Correct data type: `Class` , `ID` and `BIRADS` columns are currently numbers or characters, but for ML in R, they need to be factors(categorical variables).
There are 96 Benign case and 102 Malignant cases. Most of the benign cases ( Class 0) align with lower `BIRADS` and malignant cases (Class 1) align with higher `BIRADS` . However, there will be some challenge in BIRADS 4a and 4b. Most of 4a are benign but some are malignant and most of 4b are malignant but some are benign. This is a good place to use Kolmogorov-Smirnov Test to see whether or not their distributions (4a vs 4b) are distinct.
The anomaly here is there are two cases of benign that have BIRADS 5 ~ this is false positive. We can use Proportion of Rejections test to evaluate this further.
Next, we want to check normality for all the features for each class.
Looking at density plots above, we see that distributions that exhibit strong seperations between the Benign (Green) and Malignant (Red) classes are: distributions: U_Intra, U_Peri, X_Intra, X_Peri, Mu_Intra, Alpha_Intra. This visual gap provides the groundwork for the Kolmogorov-Smirnov (KS) test.
II. Test Statistics
1. Two-Sample Kolmogorov-Smirnov (KS) Test
Before we head to Machine Learning models, we will use the Two-Sample Kolmogorov-Smirnov (KS) Test.
Suppose we have two samples size \(m\) and size \(n\)The Two-Sample Kolmogorov-Smirnov (KS) Test use to test whether or not two samples came from the same distribution. The null hypothesis is \(H_0\): both samples (Class 0 and Class 1) come from a population with the same distribution. The test statistics is D - statistics which has values between 0 and 1. The higher the D the more separated “Green” and “Red” curves are. And ultimately if the p-value < 0.05, we reject \(H_0\) and say there is a statistically significant difference between Benign and Malignant for that feature.
# Create a summary table of KS results for all numeric featuresks_summary <- master %>%select(where(is.numeric)) %>%select(-any_of(c("Class", "ID", "Scan", "BIRADS", "Lesion"))) %>%map_df(~{ test <-ks.test(.x[master$Class ==0], .x[master$Class ==1])data.frame(D_statistic = test$statistic, p_value = test$p.value) }, .id ="Feature") %>%arrange(desc(D_statistic)) # put the most significant at the top print(ks_summary)
For those features that are statistically significant, we identify them as the most viable biomarkers for the classification task.
2. Class 0 Reference Distributions and Evaluation of Malignant Rejection Rates
The procedure:
For each feature (Nakagami, Gamma, etc.), we take all Benign scans and turn their parameters into curves (plug the parameters into their distribution’s pdf). Average those curves to create one single distribution.
Once we have the single distribution data we decide where “normal” ends and “non-normal” begins. And how do we do this? we marks the outer 5% of the curve as rejection region. If a lesion falls within this rejection region, it is statistically rejected because it doesn’t look like typical benign lesion.
Next, we test all 104 malignant observations. We look at where their values land on the x-axis relative to single distribution data.
If a malignant lands on that 5% rejection region, the test is a success - it “found” the cancer.
If it lands in the normal region, the test “missed” it.
Finally, we calculate the proportion of rejection. We want to know: What proportion of the malignant cases did the single distribution correctly find as being not benign.
This test will let us know which distribution (Nakagami, Gamma, and HK) is the better tool for this analysis.
#define Nakagami PDF:nakagami_pdf <-function(x, mu, omega){ term1 <- (2*mu^mu)/(gamma(mu)*omega^mu) term2 <- x^(2*mu-1) term3<-exp(-(mu/omega)*x^2)return(term1*term2*term3)}# filter class 0class0_data <- master %>%filter(Class==0) #head(class0_dat)#Nakagami table:# Since both mu features are not statistcally significant, we use the averageavg_mu_intra_0 <-mean(class0_data$Mu_Intra)avg_mu_peri_0 <-mean(class0_data$Mu_Peri)class0_nakagami <- class0_data %>%select( M= Mu_Peri, Omega = Omega_Peri)# Table for INTRA regionclass0_nakagami_intra <- class0_data %>%select(Omega = Omega_Intra) %>%mutate(Mu = avg_mu_intra_0)# Table for PERI regionclass0_nakagami_peri <- class0_data %>%select(Omega = Omega_Peri) %>%mutate(Mu = avg_mu_peri_0)# Define x-axis: (for pixel: 0=black, to 255 = white)# checking range of class0_nakagami_peri and class0_nakagami_intra values to get idea of x_axis rangesummary(class0_nakagami_peri$Mu)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3908 0.3908 0.3908 0.3908 0.3908 0.3908
summary(class0_nakagami_peri$Omega)# max is roughly 45 millions, Omega is sqrt(intensity) which is 6764.487
Min. 1st Qu. Median Mean 3rd Qu. Max.
921804 3537296 5716001 8075811 9036075 45758278
max_intensity <-7000x_axis <-seq(1, max_intensity, length.out=2000)#Periperi_pdf <-sapply(1:nrow(class0_nakagami_peri), function(i) {nakagami_pdf(x_axis, mu = class0_nakagami_peri$Mu[i], omega = class0_nakagami_peri$Omega[i])})#averaging to get a single persingle_peri_pdf <-rowMeans(peri_pdf)#head(single_peri_pdf)# Intraintra_pdf <-sapply(1:nrow(class0_nakagami_intra), function(i) {nakagami_pdf(x_axis, mu = class0_nakagami_intra$Mu[i], omega = class0_nakagami_intra$Omega[i])})single_intra_pdf <-rowMeans(intra_pdf)#head(single_intra_pdf)# plot to seeplot(x_axis, single_peri_pdf, type="l", col="blue", lwd=3,main ="Benign Template",xlab ="Pixel Intensity (Amplitude)",ylab ="Probability")lines(x_axis, single_intra_pdf, col="red", lwd=3)
The blue curve is the peri tumoral region (tumor boundary) and the red curve represents the intra tumoral region (tumor center). From the gragh we see that the blue curve spikes around 0 (\(\mu <0.5\)). This characterizes Pre-Rayleig scattering - indicating regions with low scatterer density (dark) and irregular area at the tumor edge. The red curve peaks away from 0 (\(\mu >0.5\)), maybe near the value 500, indicates characteristic of Rayleigh or Post-Rayleigh scattering. This suggests the center of benign tumors is brighter and contains more regular patterns compared to the edges.
Thus the strong visual separation between Peri and Intra templates confirm that statistical “characters” of tissue varies by region. Therefore, these distinct Class 0 provides a well baseline for calculating the rejection rates of Class 1
#Find the 95% rejection threshold# Function to calculate where the area under the curve hits 95%get_cutoff <-function(x_vals, pdf_curve) { step_size <- x_vals[2] - x_vals[1] cumulative_area <-cumsum(pdf_curve * step_size)# normalize to ensure max is exactly 1.0 cumulative_area <- cumulative_area /max(cumulative_area)# find the first X value where the area passes 0.95 cutoff_index <-which(cumulative_area >=0.95)[1]return(x_vals[cutoff_index])}# Calculate cutoffs for your generated templatescutoff_peri <-get_cutoff(x_axis, single_peri_pdf)cutoff_intra <-get_cutoff(x_axis, single_intra_pdf)cat("Benign Cutoff Threshold (Peri):", round(cutoff_peri, 2), "\n")
# Test the malignant cases (Class 1)class1_data <- master %>%filter(Class ==1)#Convert Omega (Power) to Amplitude# our data is Omega (Intensity^2).# we must take the Square Root of Omega to compare apples to apples.malig_scores_peri <-sqrt(class1_data$Omega_Peri)malig_scores_intra <-sqrt(class1_data$Omega_Intra)# Check which malignant cases fall past the benign cutoff# TRUE = Rejected (detected as anomaly), FALSE = Accepted (missed)rejected_peri <- malig_scores_peri > cutoff_perirejected_intra <- malig_scores_intra > cutoff_intra# Calculate the percentagerate_peri <-mean(rejected_peri) *100rate_intra <-mean(rejected_intra) *100cat("FINAL NAKAGAMI REJECTION RATES:\n")
From the rejection result, Nakagami model failed to distinguish between Benign and Malignant tumor: Peri Detection Rate: 0.96%, meaning uut of roughly 100 cancer cases, this test only caught 1. Intra Detection Rate: 2.88% - it caught roughly 3 out of 100 cancer cases. After all, both blue curve and red curve have wide long right tail - meaning benign and malignant tumors can be anything from 0 to a few thousands amplitude, provided poor discrimination between benign and malignant lesions.
Since the simple Nakagami model failed to catch the cancer, we can move on to Gamma and HK to see if they can capture the “clustering” of cancer cells better.
To prevent data from accidentally “memorizing” characteristics of the features, we perform the split first before normalizing dataset.
Since this is a small dataset, we use K-Fold Cross-Validation. It is more reliable than training/testing.
set.seed(123)lesion_folds <-group_vfold_cv(master, group = ID, v=10) # we group by ID# 100 lesion, each with 2 scans = 200 rows,we put 10 lesions (20 rows) aside and then use # the remaining 180 rows to train, then it tests on 10 lesions above. #The process repeats 10 times and finally average the accuracy across all 10 tries.# Normalizinglesion_recipe <-recipe(Class ~ ., data = master) %>%update_role(ID, Scan, BIRADS, new_role ="metadata") %>%step_normalize(all_predictors()) # features are not normalize yet, since I dont want it calculate mean, variance now and might "memorize" the global distribution.