Executive Summary

Introduction

Breast cancer is one of the leading causes of cancer-related mortality among women. Early and accurate diagnosis is critical for effective treatment and improved patient outcomes. This case study explores the application of statistic modeling to predict breast cancer malignancy from Fine Needle Aspiration (FNA) test results. This study focuses on a dataset containing features describing characteristics of cell nuclei found in the FNA samples.

Key Findings

  • Model Performance: The study evaluated several predictive models, with the Random Forest model showing the highest accuracy (96.47%) and the area under the curve (AUC) of 0.9933, indicating excellent predictive capability.
  • Feature Importance: Key features driving model predictions were identified, including ‘radius_worst’, ‘perimeter_worst’, ‘area_worst’, and ‘concave.points_worst’. These features are significantly correlated with the diagnosis of malignant tumors.
  • Model Comparison: The Support Vector Machine (SVM) and k-Nearest Neighbors (KNN) models also performed well, with accuracies above 96%, but Random Forest provided the best balance of accuracy and interpretability.

Insights Behind Key Findings

  • Diagnostic Accuracy: The high performance of the Random Forest model highlights the potential of machine learning in diagnosis accuracy beyond traditional methods.
  • Clinical Utility: The identification of the key features provides insights into tumor characteristics most associated with malignancy which can aid physicians in accurate diagnosis.
  • Algorithm Selection: The comparison of various models highlights the importance of selecting the right algorithm that not only performs well statistically but also aligns with needs for interpretability.

Problem

Overview

Traditional diagnostic methods, while effective, can sometimes be invasive, costly, and not accessible to all populations. The application of machine learning in medical diagnostics presents an opportunity to enhance the accuracy, efficiency, and accessibility of breast cancer detection.

Application Perspective

From an application standpoint, the problem centers around the need to improve the diagnostic process for breast cancer using Fine Needle Aspiration (FNA) tests. These tests, which involve extracting cell samples from a breast lump using a fine needle, are less invasive than surgical biopsies and quicker to perform. However, the interpretation of FNA test results can be subjective and reliant on the expertise of the cytologist. By applying machine learning techniques to analyze features derived from digitized images of FNA samples, we aim to develop a predictive model that can assist medical professionals by providing a second, highly accurate opinion.

Theoretical Perspective

Theoretically, the challenge is to apply classification models to predict whether FNA test results indicate benign or malignant breast tumors based on features describing the characteristics of cell nuclei present in the sample images. This involves understanding which features (e.g., texture, shape, and size of cell nuclei) are most predictive of malignancy and developing a model that can generalize well from training data to unseen data while maintaining high sensitivity and specificity. Machine learning models offer the promise of capturing complex patterns in data that might be missed by human eyes, thus potentially outperforming traditional diagnostic methods in terms of accuracy and speed.

Literature Review

https://www.cancer.gov/types/breast

https://stackoverflow.com/questions/46572275/the-train-function-in-r-caret-package

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9175124/

Methods

Data Description

The dataset utilized in this study consists of features extracted from digitized images of Fine Needle Aspiration (FNA) samples of breast masses. Each entry in the dataset corresponds to a single FNA test result, with the following attributes:

  • Patient ID number: Unique identifier for each patient.
  • Diagnosis: Categorical variable indicating whether the tumor is benign (B) or malignant (M).
  • Numerical Features: Includes ten real-valued features computed for each cell nucleus, such as radius (mean of distances from center to perimeter), texture (standard deviation of gray-scale values), perimeter, area, smoothness (local variation in radius lengths), compactness (perimeter squared divided by area minus 1.0), concavity (severity of concave portions of the contour), concave points (number of concave portions of the contour), symmetry, and fractal dimension (“coastline approximation” - 1). Each of these features is recorded three times: mean, standard error, and “worst” or largest (mean of the three largest values), resulting in 30 features per sample.

The dataset contains 569 instances, with 357 benign and 212 malignant cases, reflecting a moderately balanced scenario that facilitates model training without excessive bias toward one class.

Sampling Techniques

The data is derived from a public repository and represents a convenience sample, typical in medical research where specific conditions, such as breast cancer, are studied. The data was randomly split into training (70%) and testing (30%) sets to evaluate the performance of the models in an unbiased manner. This split ensures that the model can be tested on unseen data, simulating how the model would perform in practical scenarios.

Models and Assumptions

Several machine learning models were evaluated:

  • Logistic Regression: A statistical model that assumes a logistic function to model the probability of the default class. It assumes linearity between the logit of the outcome and each predictor variable. The model is highly interpretable but may not capture complex patterns as effectively as other algorithms.
  • Decision Trees: Non-parametric models that split the data into subsets based on feature values, assuming that the data can be segmented into classes using decision rules inferred from the features. Trees can overfit the data but are very interpretable.
  • Random Forests: An ensemble of decision trees that improves prediction stability and accuracy by averaging multiple deep decision trees, trained on different parts of the same training set. It assumes that individual trees can be independently constructed by random sampling of data points and features.
  • SVM: Assumes that the data is linearly separable in a higher-dimensional space and finds the hyperplane that maximally separates the classes. Kernel functions can relax this assumption, allowing the SVM to capture more complex relationships.

Limitations

  • Sample Size: While adequate for initial analyses, larger datasets might be needed to fully train and validate models.
  • Overfitting: Complex models like decision trees might overfit the training data, performing well on training data but poorly on testing data.

Data

Data Handling

  • Data Cleaning: Initial checks for missing values and inconsistencies were performed.
  • Identification and Treatment of Outliers: Outlier detection was conducted using statistical methods like the Interquartile Range (IQR) and box plots. Outliers were assessed for their impact on the analysis. In most cases, outliers were retained because they represent valid extremes that are meaningful in medical diagnostics.
  • Data Transformation: Several features exhibited skewed distributions, which were normalized using log transformations to reduce the impact of extreme values and to meet the assumptions of certain statistical tests and models.
  • Feature Scaling: All numeric features were standardized (mean-centered and scaled to have unit variance) to enhance the performance of models sensitive to feature scaling, such as SVM and KNN.

Data Exploration

Before diving into predictive modeling, exploratory data analysis (EDA) was conducted:

  • Distribution Analysis: Histograms and density plots were created for all continuous variables to understand their underlying distributions and to identify any deviations from expected patterns.
  • Correlation Analysis: A correlation matrix was utilized to identify multicollinearity among features. High correlations between features like radius_mean, perimeter_mean, and area_mean were observed, which was expected.
  • Feature Relationships: Scatter plots and pair plots were used to explore relationships between features and to visualize how features differ by diagnosis category. This helped in understanding which features are most discriminative in separating the two classes.

Preliminary Data Insights

  • Key Distributions: The features related to the geometry of the cell nuclei (area, perimeter, radius) showed distinct distributions when segmented by diagnosis, indicating their potential utility in predicting cancer malignancy.
  • Feature Correlations: Several features were highly correlated with each other, suggesting redundancy. For instance, the features derived from the radius, perimeter, and area were closely linked, which informed decisions during feature selection to avoid multicollinearity in models like logistic regression.
  • Class Imbalance: Although the dataset was mostly balanced, the slight imbalance was addressed during model training by using stratified sampling to ensure that both classes were adequately represented in training and testing sets.

Results

Model Performance Overview

The predictive modeling yielded several key results, with models evaluated based on their accuracy, sensitivity, specificity, area under the curve (AUC), and kappa statistics. Below is a summary of the performance of major models tested:

  • Random Forest: Demonstrated the highest overall performance with an accuracy of 96.47%, sensitivity of 97.20%, specificity of 95.24%, and an impressive AUC of 0.9933. It also had a kappa of 0.9243, indicating excellent agreement beyond chance.
  • SVM: Also showed high performance with robust accuracy and was particularly effective in maximizing the margin between classes.
  • Logistic Regression: Offered good interpretability, though slightly lower performance metrics compared to ensemble methods and SVM, which was anticipated due to its linear nature.
  • Decision Trees: Provided clear insights into the decision-making process, though prone to overfitting, which was mitigated somewhat in the Random Forest ensemble.

Analysis of Covariate Relationships

The relationships between covariates and the response variable (diagnosis) were explored through the models:

  • Size-related Features: Features like radius_worst, perimeter_worst, area_worst, and concave.points_worst were highly predictive of malignancy. This aligns with clinical understandings that malignant tumors tend to be larger and have more irregular shapes than benign tumors.
  • Texture and Compactness: These features also played significant roles, especially in SVM and Random Forest models, indicating that the internal consistency and variability of cell structures in the tumor tissues are indicative of malignant transformations.

Insights Behind Relationships

  • Tumor Size and Shape: Malignant tumors often exhibit rapid growth rates and irregular borders, both of which are captured by perimeter and area-related features. These characteristics are considered key indicators of cancer severity.
  • Cellular Features: Features like compactness, concavity, and texture relate to how densely packed and irregular the cells are within the tumors. High values in these features generally suggest aggressive an cancer, which the models captured and used for predictions.

Implications for Clinical Application

  • Enhanced Screening Protocols: Incorporating machine learning predictions could help radiologists prioritize cases for review, focusing on those with a higher likelihood of malignancy.
  • Personalized Treatment: By understanding the specific features that drive a diagnosis of malignancy, clinicians can better tailor treatment plans according to the predicted tumor behavior.
  • Early Detection: Models like Random Forest, with high sensitivity, could be crucial in screening programs, potentially catching more cases at an early stage where treatment is more likely to be successful.

Limitations and Future Work

While the models performed well, limitations include the need for larger and more diverse datasets to enhance the generalization of the findings. Also, integrating these models into the real world will require validation studies involving clinical settings to fully understand the impact and usability.

Conclusions

Summary of Aim and Results

The primary aim of this study was to explore the application of machine learning techniques to improve the accuracy of breast cancer diagnosis using FNA test results. By analyzing a dataset composed of features derived from images of breast mass samples, several predictive models were developed and evaluated.

The key findings from the study include:

  • Important features identified across models included tumor size metrics (e.g., radius, perimeter, area) and texture, which are critical indicators of malignancy in clinical practice.

These results validate the potential of machine learning models to act as effective diagnostic aids, providing support to clinicians in making more accurate and timely cancer diagnoses.

Discussion of Alternative Methods

While the methods employed in this study yielded promising results, alternative approaches could be considered to further enhance model performance or address specific challenges:

  • Ensemble Learning: Beyond Random Forest, other ensemble methods like Gradient Boosting Machines (GBM) or Stacking could be explored. These methods often provide improved predictive performance by combining the strengths of multiple learning algorithms.
  • Feature Engineering: Additional features could be engineered from the existing data, or new data types like genetic markers or patient demographics could be included to provide models with a richer context for making predictions.
  • Regularization Techniques: To handle potential overfitting and enhance the generalizability of models like logistic regression or neural networks, regularization techniques (e.g., L1, L2 regularization) could be applied.

Future Directions

Continued research should focus on integrating machine learning models into clinical workflows, including conducting prospective studies to validate these models in a clinical setting. Additionally, efforts should be made to make these models interpretable to clinicians to ensure they complement existing diagnostic processes without replacing the critical oversight provided by experienced medical professionals.


Preprocessing & EDA

Checking for na values and changing diagnosis to factor

sum(is.na(cancer_data))
[1] 0
cancer_data$diagnosis <- as.factor(cancer_data$diagnosis)
feature_list <- c("radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean")
plots_list <- list()

for (feature in feature_list) {
  plot <- ggplot(cancer_data, aes(x = diagnosis, y = get(feature), fill = diagnosis)) +
    geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 2) +
    ggtitle(paste(feature)) +
    xlab("Diagnosis") + 
    ylab(feature)
  plots_list[[feature]] <- plot
  plot <- NULL
}
do.call(grid.arrange, c(plots_list, ncol = 2))

ggplot(cancer_data, aes(x = radius_mean)) + 
    geom_histogram(bins = 30, alpha = 0.7) +
    ggtitle("Distribution of Mean Radius") +
    xlab("Mean Radius") + 
    ylab("Frequency")

set.seed(123)
cancer_data <- cancer_data[,-1]
preproc <- preProcess(cancer_data[, -1], method = c("center", "scale"))
cancer_data_norm <- predict(preproc, cancer_data[, -1])

cancer_data_norm <- cbind(diagnosis = cancer_data$diagnosis, cancer_data_norm)

trainIndex <- createDataPartition(cancer_data_norm$diagnosis, p = 0.7, list = FALSE)
train_data <- cancer_data_norm[trainIndex, ]
test_data <- cancer_data_norm[-trainIndex, ]

Model Comparison

control <- trainControl(method = "repeatedcv", number = 10, repeats = 4)

# set.seed(123)
# model_log <- train(diagnosis ~ ., data = train_data, method = "glm", 
#                    family = "binomial", trControl = control)

set.seed(123)
model_tree <- train(diagnosis ~ ., data = train_data, method = "rpart", 
                    trControl = control)

set.seed(123)
model_rf <- train(diagnosis ~ ., data = train_data, method = "rf", 
                  trControl = control, ntree = 100)

set.seed(123)
model_svm <- train(diagnosis ~ ., data = train_data, method = "svmRadial", 
                   trControl = control, preProcess = c("center", "scale"))

Tree Model

model_tree
CART 

399 samples
 30 predictor
  2 classes: 'B', 'M' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times) 
Summary of sample sizes: 360, 359, 359, 359, 359, 359, ... 
Resampling results across tuning parameters:

  cp         Accuracy   Kappa    
  0.0000000  0.9360577  0.8642194
  0.1006711  0.9098077  0.8029900
  0.8053691  0.8120673  0.5381601

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.

Random Forest Model

model_rf
Random Forest 

399 samples
 30 predictor
  2 classes: 'B', 'M' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times) 
Summary of sample sizes: 360, 359, 359, 359, 359, 359, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9517147  0.8965686
  16    0.9548558  0.9036565
  30    0.9517147  0.8967690

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 16.

SVM Model

model_svm
Support Vector Machines with Radial Basis Function Kernel 

399 samples
 30 predictor
  2 classes: 'B', 'M' 

Pre-processing: centered (30), scaled (30) 
Resampling: Cross-Validated (10 fold, repeated 4 times) 
Summary of sample sizes: 360, 359, 359, 359, 359, 359, ... 
Resampling results across tuning parameters:

  C     Accuracy   Kappa    
  0.25  0.9604647  0.9146251
  0.50  0.9711218  0.9378662
  1.00  0.9729968  0.9419621

Tuning parameter 'sigma' was held constant at a value of 0.03945471
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.03945471 and C = 1.

RF - Testing

predictions_rf <- predict(model_rf, test_data)
confusionMatrix(predictions_rf, test_data$diagnosis)
Confusion Matrix and Statistics

          Reference
Prediction   B   M
         B 102   2
         M   5  61
                                         
               Accuracy : 0.9588         
                 95% CI : (0.917, 0.9833)
    No Information Rate : 0.6294         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.9126         
                                         
 Mcnemar's Test P-Value : 0.4497         
                                         
            Sensitivity : 0.9533         
            Specificity : 0.9683         
         Pos Pred Value : 0.9808         
         Neg Pred Value : 0.9242         
             Prevalence : 0.6294         
         Detection Rate : 0.6000         
   Detection Prevalence : 0.6118         
      Balanced Accuracy : 0.9608         
                                         
       'Positive' Class : B              
                                         

SVM - Testing

predictions_svm <- predict(model_svm, test_data)
confusionMatrix(predictions_svm, test_data$diagnosis)
Confusion Matrix and Statistics

          Reference
Prediction   B   M
         B 103   3
         M   4  60
                                         
               Accuracy : 0.9588         
                 95% CI : (0.917, 0.9833)
    No Information Rate : 0.6294         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.912          
                                         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.9626         
            Specificity : 0.9524         
         Pos Pred Value : 0.9717         
         Neg Pred Value : 0.9375         
             Prevalence : 0.6294         
         Detection Rate : 0.6059         
   Detection Prevalence : 0.6235         
      Balanced Accuracy : 0.9575         
                                         
       'Positive' Class : B              
                                         

Random Forest Analysis

set.seed(123) 
rf_model <- randomForest(diagnosis ~ ., data = train_data, ntree = 500, mtry = 2, importance = TRUE)

rf_predictions <- predict(rf_model, newdata = test_data)
rf_confusion_matrix <- confusionMatrix(rf_predictions, test_data$diagnosis)
print(rf_confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction   B   M
         B 104   3
         M   3  60
                                          
               Accuracy : 0.9647          
                 95% CI : (0.9248, 0.9869)
    No Information Rate : 0.6294          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9243          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9720          
            Specificity : 0.9524          
         Pos Pred Value : 0.9720          
         Neg Pred Value : 0.9524          
             Prevalence : 0.6294          
         Detection Rate : 0.6118          
   Detection Prevalence : 0.6294          
      Balanced Accuracy : 0.9622          
                                          
       'Positive' Class : B               
                                          
rf_pred_prob <- predict(rf_model, newdata = test_data, type = "prob")[,2]  
roc_obj <- roc(test_data$diagnosis, rf_pred_prob)
plot(roc_obj, main = "ROC Curve for Random Forest")

auc(roc_obj)
Area under the curve: 0.9933
rf_importance <- importance(rf_model)
print(rf_importance)
                                B         M MeanDecreaseAccuracy
radius_mean             10.236062  7.731982            11.732925
texture_mean             6.448378  8.245617             9.433029
perimeter_mean           8.988261  7.941451            10.624452
area_mean               10.810357  8.562741            12.393963
smoothness_mean          4.770865  6.443158             8.032897
compactness_mean         5.046225  6.410386             7.627476
concavity_mean           8.269771  8.342716            11.463128
concave.points_mean      7.895845  9.450777            11.473158
symmetry_mean            2.423679  5.474341             5.079133
fractal_dimension_mean   5.115642  1.284681             5.192985
radius_se                5.914600  6.060879             8.033135
texture_se               3.256261  1.342091             3.594407
perimeter_se             5.070851  4.572926             6.809057
area_se                  6.081854  6.900908             8.489910
smoothness_se            2.490906  2.344951             3.445838
compactness_se           4.269071  2.265431             4.365899
concavity_se             3.925541  4.837293             6.439814
concave.points_se        3.208878  3.014068             4.363392
symmetry_se              2.864749  1.618175             3.277276
fractal_dimension_se     2.893913  1.847954             3.666368
radius_worst            11.564202 10.189436            14.075722
texture_worst            7.838943  9.538604            11.146527
perimeter_worst         10.795555 10.947340            14.255299
area_worst              12.255440 10.624462            14.787458
smoothness_worst         5.466823  7.005480             8.449408
compactness_worst        6.924880  6.565461             9.333456
concavity_worst          8.959237  8.706496            11.944962
concave.points_worst    11.958970  9.979048            14.876459
symmetry_worst           7.396155  7.016404            10.296186
fractal_dimension_worst  5.533096  4.965504             7.447399
                        MeanDecreaseGini
radius_mean                    11.842355
texture_mean                    3.539811
perimeter_mean                 13.299253
area_mean                      12.120516
smoothness_mean                 2.064819
compactness_mean                5.396609
concavity_mean                 11.201664
concave.points_mean            10.578365
symmetry_mean                   1.563842
fractal_dimension_mean          1.375167
radius_se                       5.337706
texture_se                      1.019867
perimeter_se                    4.501245
area_se                         7.470514
smoothness_se                   1.117554
compactness_se                  1.860387
concavity_se                    3.097685
concave.points_se               2.038764
symmetry_se                     1.206888
fractal_dimension_se            1.294903
radius_worst                   14.518534
texture_worst                   3.755351
perimeter_worst                13.941580
area_worst                     14.342749
smoothness_worst                2.820616
compactness_worst               5.608722
concavity_worst                 8.695803
concave.points_worst           14.872499
symmetry_worst                  3.369300
fractal_dimension_worst         2.364776