Breast cancer is one of the leading causes of cancer-related mortality among women. Early and accurate diagnosis is critical for effective treatment and improved patient outcomes. This case study explores the application of statistic modeling to predict breast cancer malignancy from Fine Needle Aspiration (FNA) test results. This study focuses on a dataset containing features describing characteristics of cell nuclei found in the FNA samples.
Traditional diagnostic methods, while effective, can sometimes be invasive, costly, and not accessible to all populations. The application of machine learning in medical diagnostics presents an opportunity to enhance the accuracy, efficiency, and accessibility of breast cancer detection.
From an application standpoint, the problem centers around the need to improve the diagnostic process for breast cancer using Fine Needle Aspiration (FNA) tests. These tests, which involve extracting cell samples from a breast lump using a fine needle, are less invasive than surgical biopsies and quicker to perform. However, the interpretation of FNA test results can be subjective and reliant on the expertise of the cytologist. By applying machine learning techniques to analyze features derived from digitized images of FNA samples, we aim to develop a predictive model that can assist medical professionals by providing a second, highly accurate opinion.
Theoretically, the challenge is to apply classification models to predict whether FNA test results indicate benign or malignant breast tumors based on features describing the characteristics of cell nuclei present in the sample images. This involves understanding which features (e.g., texture, shape, and size of cell nuclei) are most predictive of malignancy and developing a model that can generalize well from training data to unseen data while maintaining high sensitivity and specificity. Machine learning models offer the promise of capturing complex patterns in data that might be missed by human eyes, thus potentially outperforming traditional diagnostic methods in terms of accuracy and speed.
https://www.cancer.gov/types/breast
https://stackoverflow.com/questions/46572275/the-train-function-in-r-caret-package
The dataset utilized in this study consists of features extracted from digitized images of Fine Needle Aspiration (FNA) samples of breast masses. Each entry in the dataset corresponds to a single FNA test result, with the following attributes:
The dataset contains 569 instances, with 357 benign and 212 malignant cases, reflecting a moderately balanced scenario that facilitates model training without excessive bias toward one class.
The data is derived from a public repository and represents a convenience sample, typical in medical research where specific conditions, such as breast cancer, are studied. The data was randomly split into training (70%) and testing (30%) sets to evaluate the performance of the models in an unbiased manner. This split ensures that the model can be tested on unseen data, simulating how the model would perform in practical scenarios.
Several machine learning models were evaluated:
Before diving into predictive modeling, exploratory data analysis (EDA) was conducted:
radius_mean,
perimeter_mean, and area_mean were observed,
which was expected.area, perimeter,
radius) showed distinct distributions when segmented by
diagnosis, indicating their potential utility in predicting cancer
malignancy.The predictive modeling yielded several key results, with models evaluated based on their accuracy, sensitivity, specificity, area under the curve (AUC), and kappa statistics. Below is a summary of the performance of major models tested:
The relationships between covariates and the response variable (diagnosis) were explored through the models:
radius_worst, perimeter_worst,
area_worst, and concave.points_worst were
highly predictive of malignancy. This aligns with clinical
understandings that malignant tumors tend to be larger and have more
irregular shapes than benign tumors.While the models performed well, limitations include the need for larger and more diverse datasets to enhance the generalization of the findings. Also, integrating these models into the real world will require validation studies involving clinical settings to fully understand the impact and usability.
The primary aim of this study was to explore the application of machine learning techniques to improve the accuracy of breast cancer diagnosis using FNA test results. By analyzing a dataset composed of features derived from images of breast mass samples, several predictive models were developed and evaluated.
The key findings from the study include:
These results validate the potential of machine learning models to act as effective diagnostic aids, providing support to clinicians in making more accurate and timely cancer diagnoses.
While the methods employed in this study yielded promising results, alternative approaches could be considered to further enhance model performance or address specific challenges:
Continued research should focus on integrating machine learning models into clinical workflows, including conducting prospective studies to validate these models in a clinical setting. Additionally, efforts should be made to make these models interpretable to clinicians to ensure they complement existing diagnostic processes without replacing the critical oversight provided by experienced medical professionals.
diagnosis to
factorsum(is.na(cancer_data))
[1] 0
cancer_data$diagnosis <- as.factor(cancer_data$diagnosis)
feature_list <- c("radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean")
plots_list <- list()
for (feature in feature_list) {
plot <- ggplot(cancer_data, aes(x = diagnosis, y = get(feature), fill = diagnosis)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 2) +
ggtitle(paste(feature)) +
xlab("Diagnosis") +
ylab(feature)
plots_list[[feature]] <- plot
plot <- NULL
}
do.call(grid.arrange, c(plots_list, ncol = 2))
ggplot(cancer_data, aes(x = radius_mean)) +
geom_histogram(bins = 30, alpha = 0.7) +
ggtitle("Distribution of Mean Radius") +
xlab("Mean Radius") +
ylab("Frequency")
set.seed(123)
cancer_data <- cancer_data[,-1]
preproc <- preProcess(cancer_data[, -1], method = c("center", "scale"))
cancer_data_norm <- predict(preproc, cancer_data[, -1])
cancer_data_norm <- cbind(diagnosis = cancer_data$diagnosis, cancer_data_norm)
trainIndex <- createDataPartition(cancer_data_norm$diagnosis, p = 0.7, list = FALSE)
train_data <- cancer_data_norm[trainIndex, ]
test_data <- cancer_data_norm[-trainIndex, ]
control <- trainControl(method = "repeatedcv", number = 10, repeats = 4)
# set.seed(123)
# model_log <- train(diagnosis ~ ., data = train_data, method = "glm",
# family = "binomial", trControl = control)
set.seed(123)
model_tree <- train(diagnosis ~ ., data = train_data, method = "rpart",
trControl = control)
set.seed(123)
model_rf <- train(diagnosis ~ ., data = train_data, method = "rf",
trControl = control, ntree = 100)
set.seed(123)
model_svm <- train(diagnosis ~ ., data = train_data, method = "svmRadial",
trControl = control, preProcess = c("center", "scale"))
model_tree
CART
399 samples
30 predictor
2 classes: 'B', 'M'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 360, 359, 359, 359, 359, 359, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.0000000 0.9360577 0.8642194
0.1006711 0.9098077 0.8029900
0.8053691 0.8120673 0.5381601
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.
model_rf
Random Forest
399 samples
30 predictor
2 classes: 'B', 'M'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 360, 359, 359, 359, 359, 359, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9517147 0.8965686
16 0.9548558 0.9036565
30 0.9517147 0.8967690
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 16.
model_svm
Support Vector Machines with Radial Basis Function Kernel
399 samples
30 predictor
2 classes: 'B', 'M'
Pre-processing: centered (30), scaled (30)
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 360, 359, 359, 359, 359, 359, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.9604647 0.9146251
0.50 0.9711218 0.9378662
1.00 0.9729968 0.9419621
Tuning parameter 'sigma' was held constant at a value of 0.03945471
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.03945471 and C = 1.
predictions_rf <- predict(model_rf, test_data)
confusionMatrix(predictions_rf, test_data$diagnosis)
Confusion Matrix and Statistics
Reference
Prediction B M
B 102 2
M 5 61
Accuracy : 0.9588
95% CI : (0.917, 0.9833)
No Information Rate : 0.6294
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9126
Mcnemar's Test P-Value : 0.4497
Sensitivity : 0.9533
Specificity : 0.9683
Pos Pred Value : 0.9808
Neg Pred Value : 0.9242
Prevalence : 0.6294
Detection Rate : 0.6000
Detection Prevalence : 0.6118
Balanced Accuracy : 0.9608
'Positive' Class : B
predictions_svm <- predict(model_svm, test_data)
confusionMatrix(predictions_svm, test_data$diagnosis)
Confusion Matrix and Statistics
Reference
Prediction B M
B 103 3
M 4 60
Accuracy : 0.9588
95% CI : (0.917, 0.9833)
No Information Rate : 0.6294
P-Value [Acc > NIR] : <2e-16
Kappa : 0.912
Mcnemar's Test P-Value : 1
Sensitivity : 0.9626
Specificity : 0.9524
Pos Pred Value : 0.9717
Neg Pred Value : 0.9375
Prevalence : 0.6294
Detection Rate : 0.6059
Detection Prevalence : 0.6235
Balanced Accuracy : 0.9575
'Positive' Class : B
set.seed(123)
rf_model <- randomForest(diagnosis ~ ., data = train_data, ntree = 500, mtry = 2, importance = TRUE)
rf_predictions <- predict(rf_model, newdata = test_data)
rf_confusion_matrix <- confusionMatrix(rf_predictions, test_data$diagnosis)
print(rf_confusion_matrix)
Confusion Matrix and Statistics
Reference
Prediction B M
B 104 3
M 3 60
Accuracy : 0.9647
95% CI : (0.9248, 0.9869)
No Information Rate : 0.6294
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9243
Mcnemar's Test P-Value : 1
Sensitivity : 0.9720
Specificity : 0.9524
Pos Pred Value : 0.9720
Neg Pred Value : 0.9524
Prevalence : 0.6294
Detection Rate : 0.6118
Detection Prevalence : 0.6294
Balanced Accuracy : 0.9622
'Positive' Class : B
rf_pred_prob <- predict(rf_model, newdata = test_data, type = "prob")[,2]
roc_obj <- roc(test_data$diagnosis, rf_pred_prob)
plot(roc_obj, main = "ROC Curve for Random Forest")
auc(roc_obj)
Area under the curve: 0.9933
rf_importance <- importance(rf_model)
print(rf_importance)
B M MeanDecreaseAccuracy
radius_mean 10.236062 7.731982 11.732925
texture_mean 6.448378 8.245617 9.433029
perimeter_mean 8.988261 7.941451 10.624452
area_mean 10.810357 8.562741 12.393963
smoothness_mean 4.770865 6.443158 8.032897
compactness_mean 5.046225 6.410386 7.627476
concavity_mean 8.269771 8.342716 11.463128
concave.points_mean 7.895845 9.450777 11.473158
symmetry_mean 2.423679 5.474341 5.079133
fractal_dimension_mean 5.115642 1.284681 5.192985
radius_se 5.914600 6.060879 8.033135
texture_se 3.256261 1.342091 3.594407
perimeter_se 5.070851 4.572926 6.809057
area_se 6.081854 6.900908 8.489910
smoothness_se 2.490906 2.344951 3.445838
compactness_se 4.269071 2.265431 4.365899
concavity_se 3.925541 4.837293 6.439814
concave.points_se 3.208878 3.014068 4.363392
symmetry_se 2.864749 1.618175 3.277276
fractal_dimension_se 2.893913 1.847954 3.666368
radius_worst 11.564202 10.189436 14.075722
texture_worst 7.838943 9.538604 11.146527
perimeter_worst 10.795555 10.947340 14.255299
area_worst 12.255440 10.624462 14.787458
smoothness_worst 5.466823 7.005480 8.449408
compactness_worst 6.924880 6.565461 9.333456
concavity_worst 8.959237 8.706496 11.944962
concave.points_worst 11.958970 9.979048 14.876459
symmetry_worst 7.396155 7.016404 10.296186
fractal_dimension_worst 5.533096 4.965504 7.447399
MeanDecreaseGini
radius_mean 11.842355
texture_mean 3.539811
perimeter_mean 13.299253
area_mean 12.120516
smoothness_mean 2.064819
compactness_mean 5.396609
concavity_mean 11.201664
concave.points_mean 10.578365
symmetry_mean 1.563842
fractal_dimension_mean 1.375167
radius_se 5.337706
texture_se 1.019867
perimeter_se 4.501245
area_se 7.470514
smoothness_se 1.117554
compactness_se 1.860387
concavity_se 3.097685
concave.points_se 2.038764
symmetry_se 1.206888
fractal_dimension_se 1.294903
radius_worst 14.518534
texture_worst 3.755351
perimeter_worst 13.941580
area_worst 14.342749
smoothness_worst 2.820616
compactness_worst 5.608722
concavity_worst 8.695803
concave.points_worst 14.872499
symmetry_worst 3.369300
fractal_dimension_worst 2.364776