Correct data type: `Class` , `ID` and `BIRADS` columns are currently numbers or characters, but for ML in R, they need to be factors(categorical variables).
There are 96 Benign case and 102 Malignant cases. Most of the benign cases ( Class 0) align with lower `BIRADS` and malignant cases (Class 1) align with higher `BIRADS` . However, there will be some challenge in BIRADS 4a and 4b. Most of 4a are benign but some are malignant and most of 4b are malignant but some are benign. This is a good place to use Kolmogorov-Smirnov Test to see whether or not their distributions (4a vs 4b) are distinct.
The anomaly here is there are two cases of benign that have BIRADS 5 ~ this is false positive. We can use Proportion of Rejections test to evaluate this further.
Next, we want to check normality for all the features for each class.
Looking at density plots above, we see that distributions that exhibit strong seperations between the Benign (Green) and Malignant (Red) classes are: distributions: U_Intra, U_Peri, X_Intra, X_Peri, Mu_Intra, Alpha_Intra. This visual gap provides the groundwork for the Kolmogorov-Smirnov (KS) test.
II. Test Statistics
1. Two-Sample Kolmogorov-Smirnov (KS) Test
Before we head to Machine Learning models, we will use the Two-Sample Kolmogorov-Smirnov (KS) Test.
Suppose we have two samples size \(m\) and size \(n\)The Two-Sample Kolmogorov-Smirnov (KS) Test use to test whether or not two samples came from the same distribution. The null hypothesis is \(H_0\): both samples (Class 0 and Class 1) come from a population with the same distribution. The test statistics is D - statistics which has values between 0 and 1. The higher the D the more separated “Green” and “Red” curves are. And ultimately if the p-value < 0.05, we reject \(H_0\) and say there is a statistically significant difference between Benign and Malignant for that feature.
# Create a summary table of KS results for all numeric featuresks_summary <- master %>%select(where(is.numeric)) %>%select(-any_of(c("Class", "ID", "Scan", "BIRADS", "Lesion"))) %>%map_df(~{ test <-ks.test(.x[master$Class ==0], .x[master$Class ==1])data.frame(D_statistic = test$statistic, p_value = test$p.value) }, .id ="Feature") %>%arrange(desc(D_statistic)) # put the most significant at the top print(ks_summary)
For those features that are statistically significant, we identify them as the most viable biomarkers for the classification task.
2. Class 0 Reference Distributions and Evaluation of Malignant Rejection Rates
The procedure:
For each feature (Nakagami, Gamma, etc.), we take all Benign scans and turn their parameters into curves (plug the parameters into their distribution’s pdf). Average those curves to create one single distribution.
Once we have the single distribution data we decide where “normal” ends and “non-normal” begins. And how do we do this? we marks the outer 5% of the curve as rejection region. If a lesion falls within this rejection region, it is statistically rejected because it doesn’t look like typical benign lesion.
Next, we test all 104 malignant observations. We look at where their values land on the x-axis relative to single distribution data.
If a malignant lands on that 5% rejection region, the test is a success - it “found” the cancer.
If it lands in the normal region, the test “missed” it.
Finally, we calculate the proportion of rejection. We want to know: What proportion of the malignant cases did the single distribution correctly find as being not benign.
This test will let us know which distribution (Nakagami, Gamma, and HK) is the better tool for this analysis.
#define Nakagami PDF:nakagami_pdf <-function(x, mu, omega){ term1 <- (2*mu^mu)/(gamma(mu)*omega^mu) term2 <- x^(2*mu-1) term3<-exp(-(mu/omega)*x^2)return(term1*term2*term3)}# filter class 0class0_data <- master %>%filter(Class==0) #head(class0_dat)#Nakagami table:# Since both mu features are not statistcally significant, we use the averageavg_mu_intra_0 <-mean(class0_data$Mu_Intra)avg_mu_peri_0 <-mean(class0_data$Mu_Peri)class0_nakagami <- class0_data %>%select( M= Mu_Peri, Omega = Omega_Peri)# Table for INTRA regionclass0_nakagami_intra <- class0_data %>%select(Omega = Omega_Intra) %>%mutate(Mu = avg_mu_intra_0)# Table for PERI regionclass0_nakagami_peri <- class0_data %>%select(Omega = Omega_Peri) %>%mutate(Mu = avg_mu_peri_0)# Define x-axis: (for pixel: 0=black, to 255 = white)# checking range of class0_nakagami_peri and class0_nakagami_intra values to get idea of x_axis rangesummary(class0_nakagami_peri$Mu)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3908 0.3908 0.3908 0.3908 0.3908 0.3908
summary(class0_nakagami_peri$Omega)# max is roughly 45 millions, Omega is sqrt(intensity) which is 6764.487
Min. 1st Qu. Median Mean 3rd Qu. Max.
921804 3537296 5716001 8075811 9036075 45758278
max_intensity <-7000x_axis <-seq(1, max_intensity, length.out=2000)#Periperi_pdf <-sapply(1:nrow(class0_nakagami_peri), function(i) {nakagami_pdf(x_axis, mu = class0_nakagami_peri$Mu[i], omega = class0_nakagami_peri$Omega[i])})#averaging to get a single persingle_peri_pdf <-rowMeans(peri_pdf)#head(single_peri_pdf)# Intraintra_pdf <-sapply(1:nrow(class0_nakagami_intra), function(i) {nakagami_pdf(x_axis, mu = class0_nakagami_intra$Mu[i], omega = class0_nakagami_intra$Omega[i])})single_intra_pdf <-rowMeans(intra_pdf)#head(single_intra_pdf)# plot to seeplot(x_axis, single_peri_pdf, type="l", col="blue", lwd=3,main ="Benign Template",xlab ="Pixel Intensity (Amplitude)",ylab ="Probability")lines(x_axis, single_intra_pdf, col="red", lwd=3)
The blue curve is the peri tumoral region (tumor boundary) and the red curve represents the intra tumoral region (tumor center). From the gragh we see that the blue curve spikes around 0 (\(\mu <0.5\)). This characterizes Pre-Rayleig scattering - indicating regions with low scatterer density (dark) and irregular area at the tumor edge. The red curve peaks away from 0 (\(\mu >0.5\)), maybe near the value 500, indicates characteristic of Rayleigh or Post-Rayleigh scattering. This suggests the center of benign tumors is brighter and contains more regular patterns compared to the edges.
Thus the strong visual separation between Peri and Intra templates confirm that statistical “characters” of tissue varies by region. Therefore, these distinct Class 0 provides a well baseline for calculating the rejection rates of Class 1
#Find the 95% rejection threshold# Function to calculate where the area under the curve hits 95%get_cutoff <-function(x_vals, pdf_curve) { step_size <- x_vals[2] - x_vals[1] cumulative_area <-cumsum(pdf_curve * step_size)# normalize to ensure max is exactly 1.0 cumulative_area <- cumulative_area /max(cumulative_area)# find the first X value where the area passes 0.95 cutoff_index <-which(cumulative_area >=0.95)[1]return(x_vals[cutoff_index])}# Calculate cutoffs for your generated templatescutoff_peri <-get_cutoff(x_axis, single_peri_pdf)cutoff_intra <-get_cutoff(x_axis, single_intra_pdf)cat("Benign Cutoff Threshold (Peri):", round(cutoff_peri, 2), "\n")
# Test the malignant cases (Class 1)class1_data <- master %>%filter(Class ==1)#Convert Omega (Power) to Amplitude# our data is Omega (Intensity^2).# we must take the Square Root of Omega to compare apples to apples.malig_scores_peri <-sqrt(class1_data$Omega_Peri)malig_scores_intra <-sqrt(class1_data$Omega_Intra)# Check which malignant cases fall past the benign cutoff# TRUE = Rejected (detected as anomaly), FALSE = Accepted (missed)rejected_peri <- malig_scores_peri > cutoff_perirejected_intra <- malig_scores_intra > cutoff_intra# Calculate the percentagerate_peri <-mean(rejected_peri) *100rate_intra <-mean(rejected_intra) *100cat("FINAL NAKAGAMI REJECTION RATES:\n")
# visualize the cutoff lineslibrary(ggplot2)plot_data <-data.frame(Amplitude =c(x_axis, x_axis),Density =c(single_peri_pdf, single_intra_pdf),Group =rep(c("Peri Template", "Intra Template"), each =length(x_axis)))ggplot(plot_data, aes(x = Amplitude, y = Density, fill = Group, color = Group)) +geom_area(alpha =0.45, position ="identity") +geom_line(linewidth =1) +geom_vline(xintercept = cutoff_peri, color ="blue", linetype =2, linewidth =1) +geom_vline(xintercept = cutoff_intra, color ="red", linetype =2, linewidth =1) +labs(title ="Nakagami Parameter-Based Rejection Test",x ="Pixel Intensity (Amplitude)",y ="Probability" ) +scale_color_discrete(name =NULL) +scale_fill_discrete(name =NULL) +theme_minimal()
From the rejection result, Nakagami model failed to distinguish between Benign and Malignant tumor: Peri Detection Rate: 0.96%, meaning uut of roughly 100 cancer cases, this test only caught 1. Intra Detection Rate: 2.88% - it caught roughly 3 out of 100 cancer cases. After all, both blue curve and red curve have wide long right tail - meaning benign and malignant tumors can be anything from 0 to a few thousands amplitude, provided poor discrimination between benign and malignant lesions.
Since the simple Nakagami model failed to catch the cancer, we can move on to Gamma and HK to see if they can capture the “clustering” of cancer cells better.
Numerical result explains “vertical line” on the left of the plot - benign data hurtles so close to 0 (darkness) on the scale of 46 millions.
Peri tumoral detection is 0%. Model fails to tell them apart.
Intra tumoral detection is 0.96%. Same thing: model fails to tell them apart.
So far, we tested Amplitude (Nakagami) and found <1% detection rate. We tested Intensity (Gamma) and found 0% detection rate. This implies cancer can not be detected using brihghtness alone. Next we use Homodyned - K distribution which use clusteral structuring
Homodyded - K
Unlike the Gamma or Nakagami models, the HK distribution does not have a simple “closed-form” equation. Instead, it is defined by an integral of Bessel functions that must be solved numerically.
\[
P(x)=\int_0^\infty u.x.J_0(ux). J_0(u\epsilon).\Big(1+\dfrac{u^2\sigma^2}{2 \alpha}\Big)^{-\alpha} du
\] where:
\(J_0\) is the Bessel function of first order.
\(\epsilon\) is the coherent signal (how steady the structure is)
\(x\): signal amplitude (this is what we measure)
\(\sigma^2\): diffused signal (background scattering signal)
\(\alpha\): clustering pararmeter (low \(\alpha=\) squished/spiked, high \(\alpha=\) scattering/uniform)
1. Extract the parameter \(\alpha\), \(\sigma\) and \(\epsilon\) from \(X\) and \(U\) ~ these are statistic, they are only the summary number describing data.
# define HK functionextract_hk_unqiue <-function(X, U) {# W = Log-Variance W_val <- U - X^2 W_val[W_val <0.01] <-0.01# ALPHA HK (Scatterer Density) alpha_hk <-1/ W_val alpha_hk <-pmax(0.1, pmin(20, alpha_hk))# SIGMA HK (Scale / Diffuse Power)# Scaled by 1,000,000 for numeric stability sigma_sq_scaled <- (exp(X +0.57721) /2) /1e6 sigma_hk <-sqrt(sigma_sq_scaled)# EPSILON HK (Coherent Strength) epsilon_hk <- sigma_hk *0.1return(data.frame(AlphaHK = alpha_hk, SigmaHK = sigma_hk, EpsilonHK = epsilon_hk))}# apply to tumor periperi_hk <-extract_hk_unqiue(master$X_Peri, master$U_Peri)master$AlphaHK_Peri <- peri_hk$AlphaHKmaster$SigmaHK_Peri <- peri_hk$SigmaHKmaster$EpsilonHK_Peri <- peri_hk$EpsilonHK# apply to intra periintra_hk <-extract_hk_unqiue(master$X_Intra, master$U_Intra)master$AlphaHK_Intra <- intra_hk$AlphaHKmaster$SigmaHK_Intra <- intra_hk$SigmaHKmaster$EpsilonHK_Intra <- intra_hk$EpsilonHKhead(master)
# Formula: Amplitude = sqrt(2 * Sigma^2 + Epsilon^2)# benign amplitudesbenign_data <- master[master$Class ==0, ]benign_amps <-sqrt(2* (benign_data$SigmaHK_Peri^2) + (benign_data$EpsilonHK_Peri^2))# malignant amplitudesmalignant_data <- master[master$Class ==1, ]malignant_amps <-sqrt(2* (malignant_data$SigmaHK_Peri^2) + (malignant_data$EpsilonHK_Peri^2))# 95% rejectionrejection_threshold <-quantile(benign_amps, 0.95, na.rm =TRUE)# How many malignant cases fall into that top 5% "Rejection Region"?number_found <-sum(malignant_amps > rejection_threshold, na.rm =TRUE)proportion_rejection <- (number_found /nrow(malignant_data)) *100#plotlibrary(ggplot2)plot_data <-data.frame(Amplitude =c(benign_amps, malignant_amps),Group =c(rep("Benign", length(benign_amps)), rep("Malignant", length(malignant_amps))))ggplot(plot_data, aes(x = Amplitude, fill = Group)) +geom_density(alpha =0.5) +geom_vline(xintercept = rejection_threshold, color ="red", linetype ="dashed", linewidth =1) +labs(title ="HK Parameter-Based Rejection Analysis",subtitle =paste("Rejection Proportion:", round(proportion_rejection, 2), "%"),x ="Calculated HK Amplitude", y ="Density") +theme_minimal()
Not much improve compare to Nakagami or Gamma either.
I tested all three physical models (Gamma, Nakagami, HK) using standard 95% rejection regions. All three yielded detection rates below 5%. This is because the malignant tumors in this dataset exhibit lower variance and amplitude than benign cases, causing the ‘upper-tail’ test to miss them. Therefore, a multi-variable Machine Learning approach is required to capture these inverted patterns since applying one parameter at a time may not capture cancer proportion well.
3. KS Bivariate Test
For a given feature (say U_Peri), each lesion should contribute one 2D observation: \(\text{U_Peri}_{rf1},\text{U_Peri}_{rf2}\).
So first we have to reshape from “long” (two rows per lesion) to “wide” (one row per lesion with two columns). Instead of looking at just one value \(X\) (like we did before with KS test), we look at pair \[X_{rf1},X_{rf2}\]. So every lesion becomes a point in 2D space.
Then we run a multivariate test statistics to produce p-values.
Interpretation:
Small p-value \(\rightarrow\) benign vs malignant look different in the 2D space.
Large p-value \(\rightarrow\) benign vs malignant have similar distributions, the feature probably cannot distinguish cancer.
We will use example plots to visualize this difference (or similarity)
# ============================================================# BIVARIATE (rf1, rf2) TWO-SAMPLE TEST PER FEATURE (per lesion)# Uses Lesion as the pairing key and energy distance test# ============================================================library(dplyr)library(tidyr)library(purrr)library(ggplot2)library(energy)# 0) Quick checks#------------------------------------------------------# Confirm scans are rf1/rf2unique(master$Scan)
[1] "rf1" "rf2"
# Confirm most lesions have 2 rows (rf1 + rf2); identify any odd oneslesion_row_counts <- master %>%count(Lesion)lesion_row_counts %>%count(n) #ok
Storing counts in `nn`, as `n` already present in input
ℹ Use `name = "new_name"` to pick a new name.
# A tibble: 1 × 2
n nn
<int> <int>
1 2 100
# 1) Pair rf1/rf2 per lesion for a given feature (wide format)# ----------------------------------------------------pair_feature <-function(df, feature, key ="Lesion") { df %>%distinct() %>%select(all_of(key), Class, Scan, all_of(feature)) %>%mutate(Scan =as.character(Scan)) %>%group_by(across(all_of(c(key, "Scan")))) %>%slice(1) %>%ungroup() %>%pivot_wider(names_from = Scan,values_from =all_of(feature),names_prefix =paste0(feature, "_") ) %>%filter(!is.na(.data[[paste0(feature, "_rf1")]]),!is.na(.data[[paste0(feature, "_rf2")]]))}# 2) Run a bivariate two-sample test for one feature# ------------------------------------------------------run_bivar_test <-function(df, feature, key ="Lesion", R_perm =999) { paired <-pair_feature(df, feature, key = key) X0 <- paired %>%filter(Class ==0) %>%select(ends_with("_rf1"), ends_with("_rf2")) %>%as.matrix() X1 <- paired %>%filter(Class ==1) %>%select(ends_with("_rf1"), ends_with("_rf2")) %>%as.matrix()# if something goes wrong (like too few cases), guardif (nrow(X0) <5||nrow(X1) <5) {return(tibble(Feature = feature, n0 =nrow(X0), n1 =nrow(X1),statistic =NA_real_, p_value =NA_real_,note ="Too few paired lesions in one class" )) }# Energy distance two-sample test (permutation p-value) X <-rbind(X0, X1) et <- energy::eqdist.etest(X, sizes =c(nrow(X0), nrow(X1)), R = R_perm)tibble(Feature = feature,n0 =nrow(X0),n1 =nrow(X1),statistic =unname(et$statistic),p_value =unname(et$p.value),note ="" )}# 3) Choose your feature list# ---------------------------features_to_test <-c(# physics / distribution parameters"Mu_Intra", "Mu_Peri","Omega_Intra", "Omega_Peri","Alpha_Intra", "Alpha_Peri","Theta_Intra", "Theta_Peri","X_Intra", "X_Peri","U_Intra", "U_Peri",# texture features"Contrast_Intra", "Contrast_Peri","Correlation_Intra", "Correlation_Peri","Energy_Intra", "Energy_Peri","Homogeneity_Intra", "Homogeneity_Peri",# HK features"AlphaHK_Intra", "AlphaHK_Peri","SigmaHK_Intra", "SigmaHK_Peri","EpsilonHK_Intra", "EpsilonHK_Peri")# Keep only features that actually exist in masterfeatures_to_test <- features_to_test[features_to_test %in%names(master)]# 4) Run all tests + summarize# --------------------------------------------set.seed(123)bivar_results <-map_dfr(features_to_test, ~run_bivar_test(master, .x, key ="Lesion", R_perm =999)) %>%arrange(p_value)bivar_results
From the table, strongest features are: Omega_Peri, Theta_Peri, X_Intra, X_Peri, U_Intra, U_Peri, `SigmaHK_Peri. All have p-values <= 0.001. This strongly suggest that benign and malignant have different joint distribution for these features.
Correlation_Intra, AlphaHK_Peri, AlphaHK_Peri, Omega_Intra, AlphaHK_Intra are still statistically significant, but weaker.
These features exhibit strong discriminatory potential and were considered candidates for subsequent modeling.
Next, we take a look at the scatter plot of some of the features reveal whether the bivariate separation exists
# 5) Quick bivariate plot for any one feature# ---------------------------plot_bivariate_feature <-function(df, feature, key ="Lesion") { paired <-pair_feature(df, feature, key = key) x1 <-paste0(feature, "_rf1") x2 <-paste0(feature, "_rf2")ggplot(paired, aes(x = .data[[x1]], y = .data[[x2]], color =factor(Class))) +geom_point(alpha =0.8) +labs(title =paste0("Bivariate view: ", feature, " (rf1 vs rf2)"),x =paste0(feature, " rf1"),y =paste0(feature, " rf2"),color ="Class" ) +theme_minimal()}# Example:plot_bivariate_feature(master, "U_Peri")
plot_bivariate_feature(master, "Mu_Intra")
U_Peri (p-value: 0.001) is statistically significant. The scatter plot shows somewhat two separate regions between benign and malignant.
Mu_Intra (p-value: 0.327) is not statistically significant. The two classes, benign and malignant, are overlap everywhere.
III. Predictive Models
1. Group Split and Normalization.
To prevent data from accidentally “memorizing” characteristics of the features, we perform the split first before normalizing dataset.
Since this is a small dataset, we use K-Fold Cross-Validation. It is more reliable than training/testing.
Why we use grouped K-fold CV?
Each lesion has repeated observations, such as rf1 and rf2. These are related. If one scan goes into the training set and the other goes into the test set, the model may get an unfair advantage because it has already seen something extremely similar. Thus result in data leakage.
set.seed(123)lesion_folds <-group_vfold_cv(master, group = Lesion, v=10) # create the fold. #grouped CV prevent data leakage by forcing all rows from the same lesion to stay in the same fold.#we group by ID# 100 lesion, each with 2 scans = 200 rows,we put 10 lesions (20 rows) aside and then use # the remaining 90 lesions (180 rows) to train, then it tests on 10 lesions above. #The process repeats 10 times and finally average the accuracy across all 10 tries.# Normalizinglesion_recipe <-recipe(Class ~ ., data = master) %>%update_role(ID, Scan, BIRADS, new_role ="ID") %>%step_zv(all_predictors()) %>%step_normalize(all_numeric_predictors()) # only numeric variables are normalized; features are not normalize yet, since I dont want it calculate mean, variance now and might "memorize" the global distribution. #recipe() and prep(): estimated on training fold only#applied to test fold
Normalization parameters are estimated within each training fold during cross-validation to avoid information leakage from the test data.
Here is what happens within cross validation process
For fold 1 (1 fold = 10 lesions = 20 rows):
use 9 folds as training data
use 1 fold as test data
estimate normalization values from the training part only
apply those training-based values to both train and test
fit the model on the training data
predict on the test fold
record metrics such as accuracy, ROC AUC, sensitivity, specificity
Then repeat for fold 2, fold 3, …, fold 10.
2. Logistics Regression
# ===============================================# LOGISTIC REGRESSION# ==============================================library(tidyverse)library(tidymodels)# 1. Reload the data freshmaster <-read_csv("Master_Feature_Table.csv")
Rows: 200 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ID, Scan, BIRADS
dbl (22): Lesion, Class, Mu_Intra, Omega_Intra, Mu_Peri, Omega_Peri, Alpha_I...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 2. Correct data types# Make Malignant the FIRST level so sens/spec use Malignant as the positive classmaster <- master %>%mutate(Lesion =factor(Lesion),ID =factor(ID),Scan =factor(Scan),BIRADS =factor(BIRADS),Class =factor(as.character(Class),levels =c("1", "0"),labels =c("Malignant", "Benign")) )# Check outcome codinglevels(master$Class)
[1] "Malignant" "Benign"
table(master$Class, useNA ="ifany")
Malignant Benign
104 96
# 3. Grouped 10-fold cross-validation# Keep rf1 and rf2 from the same lesion in the same foldset.seed(123)lesion_folds <-group_vfold_cv(master, group = Lesion, v =10)# 4. Preprocessing recipe# Exclude Lesion, ID, Scan, and BIRADS from predictors# Remove zero-variance predictors# Normalize numeric predictors inside each training fold onlylesion_recipe <-recipe(Class ~ ., data = master) %>%update_role(Lesion, ID, Scan, BIRADS, new_role ="ID") %>%step_zv(all_predictors()) %>%step_normalize(all_numeric_predictors())# 5. Define logistic regression modellog_spec <-logistic_reg() %>%set_engine("glm") %>%set_mode("classification")# 6. Combine recipe + model into a workflowlog_wf <-workflow() %>%add_recipe(lesion_recipe) %>%add_model(log_spec)# 7. Performance metrics# Since Malignant is the first level, sens/spec are for Malignantclass_metrics <-metric_set(roc_auc, accuracy, sens, spec)# 8. Run grouped 10-fold cross-validationset.seed(123)log_res <-fit_resamples( log_wf,resamples = lesion_folds,metrics = class_metrics,control =control_resamples(save_pred =TRUE))# 9. Average performance across the 10 foldscollect_metrics(log_res)
Accuracy = 0.670. About 67.0% of lesions were classified correctly on average across the 10 folds.
ROC AUC = 0.727. This is the best overall summary here. An AUC of about 0.73 means the model has moderate discrimination. It is clearly better than random guessing, but not strong.
Sensitivity = 0.664. Since Malignant is now the positive class, this means the model correctly detects about 66.4% of malignant lesions.
Specificity = 0.700. The model correctly identifies about 70.0% of benign lesions.
So the model is reasonably balanced, but it still misses about one-third of malignant cases, which is important in a cancer-classification setting.
3. Support Vector Machine (SVM)
Data likely have non-linear boundaries and SVM also often works well with small/medium dataset.
# ============================================================# SUPPORT VECTOR MACHINE (SVM) WITH GROUPED 10-FOLD CV# ============================================================library(tidyverse)library(tidymodels)# 1. Reload the data freshmaster <-read_csv("Master_Feature_Table.csv", show_col_types =FALSE)# 2. Correct data types# Make Malignant the positive class (first level)master <- master %>%mutate(Lesion =factor(Lesion),ID =factor(ID),Scan =factor(Scan),BIRADS =factor(BIRADS),Class =factor(as.character(Class),levels =c("1", "0"),labels =c("Malignant", "Benign")) )# Check codinglevels(master$Class)
[1] "Malignant" "Benign"
table(master$Class, useNA ="ifany")
Malignant Benign
104 96
# 3. Grouped 10-fold cross-validation# Keeps rf1 and rf2 from the same lesion in the same foldset.seed(123)lesion_folds <-group_vfold_cv(master, group = Lesion, v =10)# 4. Preprocessing recipe# Exclude Lesion, ID, Scan, and BIRADS from predictors# Remove zero-variance predictors# Normalize numeric predictorslesion_recipe <-recipe(Class ~ ., data = master) %>%update_role(Lesion, ID, Scan, BIRADS, new_role ="ID") %>%step_zv(all_predictors()) %>%step_normalize(all_numeric_predictors())# 5. Define SVM model# We tune both cost and rbf_sigmasvm_spec <-svm_rbf(cost =tune(),rbf_sigma =tune()) %>%set_engine("kernlab") %>%set_mode("classification")# 6. Combine recipe + model into a workflowsvm_wf <-workflow() %>%add_recipe(lesion_recipe) %>%add_model(svm_spec)# 7. Performance metricsclass_metrics <-metric_set(roc_auc, accuracy, sens, spec)# 8. Create a tuning grid# You can increase levels later if you want a larger searchsvm_grid <-grid_regular(cost(),rbf_sigma(),levels =5)svm_grid
# 9. Tune the SVM using grouped 10-fold CVset.seed(123)svm_tune_res <-tune_grid( svm_wf,resamples = lesion_folds,grid = svm_grid,metrics = class_metrics,control =control_grid(save_pred =TRUE))
maximum number of iterations reached 1.93728e-05 1.937278e-05maximum number of iterations reached 3.447101e-05 3.447096e-05maximum number of iterations reached 0.0004446822 0.0004446709maximum number of iterations reached 0.003473281 0.003471796maximum number of iterations reached 5.490815e-05 5.490803e-05maximum number of iterations reached 0.0006631877 0.0006631417maximum number of iterations reached 0.00356196 0.003576482maximum number of iterations reached 0.003333503 0.003334978maximum number of iterations reached 0.0004911366 0.0004908758maximum number of iterations reached 0.00167486 0.001678944maximum number of iterations reached 2.542258e-05 2.542255e-05maximum number of iterations reached 3.65625e-05 3.656239e-05maximum number of iterations reached 0.0004558288 0.0004558093maximum number of iterations reached 0.004035263 0.004035291maximum number of iterations reached 6.197289e-05 6.197259e-05maximum number of iterations reached 0.0007795895 0.0007795332maximum number of iterations reached 0.004197448 0.004198777maximum number of iterations reached 0.001965818 0.001967128maximum number of iterations reached 0.0004750443 0.0004747852maximum number of iterations reached 0.001636261 0.001638206maximum number of iterations reached 2.395927e-05 2.395923e-05maximum number of iterations reached 3.469817e-05 3.469804e-05maximum number of iterations reached 0.000442619 0.000442605maximum number of iterations reached 0.003673404 0.003669484maximum number of iterations reached 6.021588e-05 6.021564e-05maximum number of iterations reached 0.0007415351 0.0007414804maximum number of iterations reached 0.004347108 0.004347015maximum number of iterations reached 0.001566566 0.001569173maximum number of iterations reached 0.000414834 0.00041461maximum number of iterations reached 0.0009182723 0.0009116822maximum number of iterations reached 4.39388e-05 -5.695755e-11maximum number of iterations reached 7.628034e-05 4.950595e-11maximum number of iterations reached 0.001021588 7.685148e-09maximum number of iterations reached 0.005310394 5.246079e-06maximum number of iterations reached 0.0001169899 6.948964e-11maximum number of iterations reached 0.001824876 3.253998e-08maximum number of iterations reached 0.001631367 3.833234e-06maximum number of iterations reached 0.0002743618 3.663407e-08maximum number of iterations reached 0.0007189497 4.894897e-08maximum number of iterations reached 8.393136e-05 1.491958e-07maximum number of iterations reached 1.963565e-05 1.963562e-05maximum number of iterations reached 3.195017e-05 3.195009e-05maximum number of iterations reached 0.0004761357 0.0004761195maximum number of iterations reached 0.004616132 0.004614161maximum number of iterations reached 6.414473e-05 6.414441e-05maximum number of iterations reached 0.0007918825 0.0007918196maximum number of iterations reached 0.00432163 0.004332861maximum number of iterations reached 0.001981041 0.001983023maximum number of iterations reached 0.0006215059 0.0006211636maximum number of iterations reached 0.001763227 0.001761619maximum number of iterations reached 2.521045e-05 2.521042e-05maximum number of iterations reached 3.565756e-05 3.565745e-05maximum number of iterations reached 0.0004690722 0.0004690529maximum number of iterations reached 0.00447277 0.004466601maximum number of iterations reached 6.083459e-05 6.083427e-05maximum number of iterations reached 0.0008359464 0.0008358743maximum number of iterations reached 0.002430696 0.002444047maximum number of iterations reached 0.002123495 0.002123861maximum number of iterations reached 0.0005902443 0.0005899266maximum number of iterations reached 0.001873214 0.001850608maximum number of iterations reached 1.934658e-05 1.934655e-05maximum number of iterations reached 2.495458e-05 2.495455e-05maximum number of iterations reached 0.000381903 0.0003819024maximum number of iterations reached 0.003978042 0.003973916maximum number of iterations reached 4.519748e-05 4.519744e-05maximum number of iterations reached 0.0006037736 0.0006037594maximum number of iterations reached 0.004686471 0.004677882maximum number of iterations reached 0.003346499 0.003356669maximum number of iterations reached 0.0001654731 0.0001653866maximum number of iterations reached 0.0008264732 0.000818263maximum number of iterations reached 2.34685e-05 2.346847e-05maximum number of iterations reached 3.915876e-05 3.915859e-05maximum number of iterations reached 0.0004475298 0.0004474984maximum number of iterations reached 0.004380399 0.004377836maximum number of iterations reached 6.411736e-05 6.411681e-05maximum number of iterations reached 0.0008693925 0.0008692883maximum number of iterations reached 0.002609221 0.002612473maximum number of iterations reached 0.001217698 0.001213602maximum number of iterations reached 0.0004316407 0.0004314013maximum number of iterations reached 0.001522571 0.001528107maximum number of iterations reached 2.225618e-05 2.225615e-05maximum number of iterations reached 2.970324e-05 2.970315e-05maximum number of iterations reached 0.0003693966 0.0003693788maximum number of iterations reached 0.003097201 0.003097655maximum number of iterations reached 5.621161e-05 5.621128e-05maximum number of iterations reached 0.0007840748 0.0007840394maximum number of iterations reached 0.004550906 0.004551954maximum number of iterations reached 0.002394355 0.002390698maximum number of iterations reached 0.0003991924 0.0003989734maximum number of iterations reached 0.003063472 0.003079001maximum number of iterations reached 1.778816e-05 1.778813e-05maximum number of iterations reached 2.007336e-05 2.007328e-05maximum number of iterations reached 0.0003925202 0.0003924999maximum number of iterations reached 0.004263793 0.004261321maximum number of iterations reached 5.128423e-05 5.128388e-05maximum number of iterations reached 0.0006485365 0.000648482maximum number of iterations reached 0.002473645 0.002476248maximum number of iterations reached 0.001303802 0.001303905maximum number of iterations reached 0.0001480462 0.0001479634maximum number of iterations reached 0.00292133 0.002907439
# 10. View all tuning resultscollect_metrics(svm_tune_res)
# 13. Get the cross-validated metrics for that best SVMbest_svm_metrics <-collect_metrics(svm_tune_res) %>%inner_join(best_svm, by =c("cost", "rbf_sigma"))best_svm_metrics
So the SVM is doing a better job overall at separating benign from malignant lesions, especially in terms of ROC AUC, which is usually the most informative summary metric here. It is also better at correctly identifying benign lesions, though it gives up a little sensitivity for malignant detection.
Compared with logistic regression: Accuracy improved from 0.670 to 0.690 ROC AUC improved from 0.727 to 0.791 Sensitivity dropped slightly from 0.664 to 0.641 Specificity improved from 0.700 to 0.753 So the SVM is doing a better job overall at separating benign from malignant lesions, especially in terms of ROC AUC, which is usually the most informative summary metric here. It is also better at correctly identifying benign lesions, though it gives up a little sensitivity for malignant detection.
Thus, Logistic regression is the good baseline but SVM is the better overall model so far.
4. Decision Tree
# ===============================================# DECISION TREE WITH GROUPED 10-FOLD CV# ===============================================library(tidyverse)library(tidymodels)library(rpart.plot)
Loading required package: rpart
Attaching package: 'rpart'
The following object is masked from 'package:dials':
prune
# 1. Reload the data freshmaster <-read_csv("Master_Feature_Table.csv", show_col_types =FALSE)# 2. Correct data types# Make Malignant the positive class (first level)master <- master %>%mutate(Lesion =factor(Lesion),ID =factor(ID),Scan =factor(Scan),BIRADS =factor(BIRADS),Class =factor(as.character(Class),levels =c("1", "0"),labels =c("Malignant", "Benign")) )# Check codinglevels(master$Class)
[1] "Malignant" "Benign"
table(master$Class, useNA ="ifany")
Malignant Benign
104 96
# 3. Grouped 10-fold cross-validationset.seed(123)lesion_folds <-group_vfold_cv(master, group = Lesion, v =10)# 4. Recipe# Tree models do not require normalizationtree_recipe <-recipe(Class ~ ., data = master) %>%update_role(Lesion, ID, Scan, BIRADS, new_role ="ID") %>%step_zv(all_predictors())# 5. Define decision tree model# Tune cost_complexity, tree_depth, and min_ntree_spec <-decision_tree(cost_complexity =tune(),tree_depth =tune(),min_n =tune()) %>%set_engine("rpart") %>%set_mode("classification")# 6. Workflowtree_wf <-workflow() %>%add_recipe(tree_recipe) %>%add_model(tree_spec)# 7. Metricsclass_metrics <-metric_set(roc_auc, accuracy, sens, spec)# 8. Tuning gridtree_grid <-grid_regular(cost_complexity(),tree_depth(),min_n(range =c(2L, 20L)),levels =4)tree_grid
# 14. Metrics for the best decision treebest_tree_metrics <-collect_metrics(tree_tune_res) %>%inner_join(best_tree, by =c("cost_complexity", "tree_depth", "min_n"))best_tree_metrics
# 15. Finalize workflow with best parametersfinal_tree_wf <-finalize_workflow(tree_wf, best_tree)# 16. Fit finalized tree to full datafinal_tree_fit <-fit(final_tree_wf, data = master)# 17. Plot the final treefinal_tree_obj <-extract_fit_parsnip(final_tree_fit)$fitrpart.plot( final_tree_obj,type =2,extra =104,fallen.leaves =TRUE,box.palette ="Blues",shadow.col ="gray",nn =TRUE,main ="Final Decision Tree for Lesion Classification")
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
Call rpart.plot with roundint=FALSE,
or rebuild the rpart model with model=TRUE.
Very large cost_complexity (0.1) collapses the tree too much, which is why that panel looks almost flat and weaker.
So the tuning plot suggests: A single decision tree has only moderate predictive ability and does not match the performance of the more flexible ensemble methods.
As for the tree: peri-tumoral features are very influential, some intra-tumoral features also matter, both distributional and texture features contribute.
The decision tree identified Omega_Peri as the primary splitting variable, indicating that peri-tumoral intensity-related information plays a major role in lesion classification. Subsequent splits involved U_Peri, Alpha_Intra, Correlation_Peri, Theta_Intra, and Contrast_Peri.
5. Random Forest
# ============================================================# RANDOM FOREST WITH GROUPED 10-FOLD CV# ============================================================library(tidyverse)library(tidymodels)library(ranger)# Reload datamaster <-read_csv("Master_Feature_Table.csv", show_col_types =FALSE)# Correct data typesmaster <- master %>%mutate(Lesion =factor(Lesion),ID =factor(ID),Scan =factor(Scan),BIRADS =factor(BIRADS),Class =factor(as.character(Class),levels =c("1", "0"),labels =c("Malignant", "Benign")) )# Grouped 10-fold CVset.seed(123)lesion_folds <-group_vfold_cv(master, group = Lesion, v =10)# Reciperf_recipe <-recipe(Class ~ ., data = master) %>%update_role(Lesion, ID, Scan, BIRADS, new_role ="ID") %>%step_zv(all_predictors())# Random forest modelrf_spec <-rand_forest(mtry =tune(),min_n =tune(),trees =1000) %>%set_engine("ranger", importance ="impurity") %>%set_mode("classification")# Workflowrf_wf <-workflow() %>%add_recipe(rf_recipe) %>%add_model(rf_spec)# Metricsclass_metrics <-metric_set(roc_auc, accuracy, sens, spec)# Count only actual predictorsp <- master %>%select(-Class, -Lesion, -ID, -Scan, -BIRADS) %>%ncol()p
ROC AUC is highest when mtry is fairly large, around 18 to 24.
The best curves also seem to come from smaller-to-moderate min_n, especially around 6 or 20, depending on the exact metric.
Sensitivity and specificity move in opposite directions a bit, which is normal. Some settings catch more malignant lesions, while others do better at recognizing benign lesions.
Overall, random forest seems to be performing in the low 0.82 range for ROC AUC, which is better than your logistic model and also a little better than your SVM.
The random forest appears to capture nonlinear structure and feature interactions better than logistic regression and SVM, yielding the strongest overall discrimination so far.
Plot variable importance
# Finalize workflow with best parametersfinal_rf_wf <-finalize_workflow(rf_wf, best_rf)# Fit finalized random forest to the full datasetfinal_rf_fit <-fit(final_rf_wf, data = master)# Extract variable importance from ranger fitrf_importance <-extract_fit_parsnip(final_rf_fit)$fit$variable.importance# Convert to tibblerf_importance_tbl <-tibble(Feature =names(rf_importance),Importance =as.numeric(rf_importance)) %>%arrange(desc(Importance))# View importance tablerf_importance_tbl
# Plot top 10 variable importancerf_importance_tbl %>%slice_max(order_by = Importance, n =10) %>%ggplot(aes(x =reorder(Feature, Importance), y = Importance)) +geom_col(fill ="steelblue") +coord_flip() +theme_minimal() +labs(title ="Top 10 Variable Importance Scores",x ="Feature",y ="Importance" )
Peri-tumoral features are especially important, since several top variables come from the peri region.Omega_Peri stands out clearly as the strongest predictor.
Both distribution-based parameters (Omega, Theta, Alpha, U, X) and texture-based measures (Correlation, Homogeneity) matter.This supports your earlier idea that classification improves when the model can combine multiple lesion characteristics instead of relying on one feature at a time.
# A tibble: 4 × 5
Model accuracy roc_auc sens spec
<chr> <dbl> <dbl> <dbl> <dbl>
1 Random Forest 0.72 0.82 0.72 0.75
2 SVM (RBF) 0.69 0.79 0.64 0.75
3 Logistic Regression 0.67 0.73 0.66 0.7
4 Decision Tree 0.67 0.72 0.67 0.67
LOGISTIC + RANDOM FOREST FOR INTRA / PERI / COMBINED
# ==================================================# COMPARISON# LOGISTIC REGRESSION + RANDOM FOREST# INTRA ONLY vs PERI ONLY vs COMBINED# ==================================================library(tidyverse)library(tidymodels)library(ranger)# 1. Reload datamaster <-read_csv("Master_Feature_Table.csv", show_col_types =FALSE)# 2. Correct data typesmaster <- master %>%mutate(Lesion =factor(Lesion),ID =factor(ID),Scan =factor(Scan),BIRADS =factor(BIRADS),Class =factor(as.character(Class),levels =c("1", "0"),labels =c("Malignant", "Benign")) )# 3. Grouped 10-fold CVset.seed(123)lesion_folds <-group_vfold_cv(master, group = Lesion, v =10)# 4. Identify feature groupsintra_vars <-names(master)[str_detect(names(master), "_Intra$")]peri_vars <-names(master)[str_detect(names(master), "_Peri$")]# 5. Metricsclass_metrics <-metric_set(roc_auc, accuracy, sens, spec)# ============================================================# A. LOGISTIC REGRESSION# ============================================================log_spec <-logistic_reg() %>%set_engine("glm") %>%set_mode("classification")# ----------------------------# Intra-only logistic# ----------------------------log_recipe_intra <-recipe(Class ~ ., data = master) %>%update_role(Lesion, ID, Scan, BIRADS, new_role ="ID") %>%step_rm(-all_of(c("Class", "Lesion", "ID", "Scan", "BIRADS", intra_vars))) %>%step_zv(all_predictors()) %>%step_normalize(all_numeric_predictors())log_wf_intra <-workflow() %>%add_recipe(log_recipe_intra) %>%add_model(log_spec)set.seed(123)log_res_intra <-fit_resamples( log_wf_intra,resamples = lesion_folds,metrics = class_metrics,control =control_resamples(save_pred =TRUE))
Warning: Using `all_of()` outside of a selecting function was deprecated in tidyselect
1.2.0.
ℹ See details at
<https://tidyselect.r-lib.org/reference/faq-selection-context.html>
# A tibble: 6 × 6
Model Feature_Set accuracy roc_auc sens spec
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Logistic Intra + Peri 0.67 0.727 0.664 0.7
2 Logistic Peri only 0.675 0.72 0.659 0.696
3 Logistic Intra only 0.645 0.712 0.631 0.707
4 Random Forest Intra + Peri 0.725 0.823 0.723 0.749
5 Random Forest Intra only 0.675 0.817 0.688 0.711
6 Random Forest Peri only 0.66 0.711 0.625 0.715
To directly evaluate the contribution of peri-tumoral information, we compared intra-only, peri-only, and combined intra-plus-peri predictor sets using logistic regression and random forest under grouped 10-fold cross-validation.
For logistic regression, the combined feature set achieved the highest ROC AUC (0.727), followed by peri-only (0.720) and intra-only (0.712).
For random forest, the combined feature set again performed best, with ROC AUC of 0.823, compared with 0.817 for intra-only and 0.711 for peri-only.
These results indicate that peri-tumoral information is most useful when combined with intra-tumoral features, rather than used in isolation. Overall, the strongest classification performance was obtained by the random forest model using the combined intra-plus-peri feature set.