library(dplyr)
library(ggplot2)
library(tidyverse)
library(pROC)
setwd("~/Desktop/datasets")
mam_mass <- read.csv("mammographic_masses.data", header = FALSE)
#since mammogram_mass is a .data document, we must convert it to a dataset and rename the column names to their correct names
colnames(mam_mass) <- c("BI_RADS", "Age", "Shape", "Margin", "Density", "Severity")
#change all "?" to proper NA values
mam_mass[mam_mass == "?"] <- NAProject 3
Introduction
Research Question: To what extent do age, mass shape, mass margin, and mass density predict breast mass severity?
About the Dataset: “Mammographic Mass” containing the results of patients who were screened for breast cancer. It has 961 observations, and 6 variables. (I will be using 5/6 for my regression) The variables are described below:
BI-RADS assessment: (numerical, discrete) I will not be using this variable, as the source states it is non-predictive)
Age: patient’s age in years (numerical, discrete)
Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (numerical discrete, can be turned into categorical, nominal)
Density: mass density (high=1 iso=2 low=3 fat-containing=4), (numerical discrete, can be turned into categorical, nominal)
Severity: benign=0 or malignant=1 (numerical, binary)
Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (numerical discrete, can be turned into categorical, nominal)
The dataset comes from the UC Irvine Machine Learning Repository. Link: https://archive.ics.uci.edu/dataset/161/mammographic+mass.
Data Analysis
I will be performing a logistic regression (for a binary categorical outcome variable). I will use the variable “Severity” as my outcome variable, and “Age”, “Shape”, “Density”, “Severity”, and “Margin” as the predictors. (All variables that are not age will be converted to categorical) I will create a grouped box plot and a scatterplot to preview the relationship between severity and its predictors- in median/spread differences, and relationship status
Load Dataset
EDA
str(mam_mass)'data.frame': 961 obs. of 6 variables:
$ BI_RADS : chr "5" "4" "5" "4" ...
$ Age : chr "67" "43" "58" "28" ...
$ Shape : chr "3" "1" "4" "1" ...
$ Margin : chr "5" "1" "5" "1" ...
$ Density : chr "3" NA "3" "3" ...
$ Severity: int 1 1 1 0 1 0 0 0 1 1 ...
#change age to integer, change all other variable values to their categorical meanings. also remove BI_RADS because it is non-predictive
head(mam_mass) BI_RADS Age Shape Margin Density Severity
1 5 67 3 5 3 1
2 4 43 1 1 <NA> 1
3 5 58 4 5 3 1
4 4 28 1 1 3 0
5 5 74 1 5 <NA> 1
6 4 65 1 <NA> 3 0
tail(mam_mass) BI_RADS Age Shape Margin Density Severity
956 4 52 4 4 3 1
957 4 47 2 1 3 0
958 4 56 4 5 3 1
959 4 64 4 5 3 0
960 5 66 4 5 3 1
961 4 62 3 3 3 0
#there are noticeable NA values in Margin and Density when checking the tail. Expand search for NAs
colSums(is.na(mam_mass)) BI_RADS Age Shape Margin Density Severity
2 5 31 48 76 0
#NA values in all columns.. filter out when cleaningCleaning
#remove BI-RADS variable, will not be used in project
mam_mass <- mam_mass |>
select(-BI_RADS)
#change Shape, Margin, and Density to their categorical values
mam_mass$Shape <- factor(mam_mass$Shape,levels = c(1, 2, 3, 4), labels = c("round", "oval", "lobular", "irregular"), ordered = TRUE)
mam_mass$Margin <- factor(mam_mass$Margin,levels = c(1, 2, 3, 4), labels = c("circumscribed", "microlobulated", "obscured", "defined"), ordered = TRUE)
mam_mass$Density <- factor(mam_mass$Density,levels = c(1, 2, 3, 4), labels = c("high", "iso", "low", "fat-containing"), ordered = TRUE)
#change age to integer
mam_mass <- mam_mass |>
mutate(Age = as.integer(mam_mass$Age))
#remove all NAs
mam_mass <- mam_mass |>
filter(!if_any(everything(), is.na))
#check to see if Age is an integer and other variable values have been successfully changed
str(mam_mass)'data.frame': 704 obs. of 5 variables:
$ Age : int 28 76 42 36 60 54 52 59 54 56 ...
$ Shape : Ord.factor w/ 4 levels "round"<"oval"<..: 1 1 2 3 2 1 3 2 1 4 ...
$ Margin : Ord.factor w/ 4 levels "circumscribed"<..: 1 4 1 1 1 1 4 1 1 3 ...
$ Density : Ord.factor w/ 4 levels "high"<"iso"<"low"<..: 3 3 3 2 2 3 3 3 3 1 ...
$ Severity: int 0 1 1 0 0 0 0 1 1 1 ...
Visualizations
Boxplot
#box plot to show age median/spread differences across cancer free and cancer diagnosed patients
ggplot(mam_mass, aes(x = factor(Severity), y = Age, fill = factor(Severity))) +
geom_boxplot() +
scale_x_discrete(labels = c("0" = "No Cancer", "1" = "Has Cancer")) +
theme_minimal() +
labs(title = "Age by Breast Cancer Diagnoses",
x = "Breast Cancer (no/yes)",
y = "Age in Years",
caption = "Source: UK STEM Foundation"
) +
theme(legend.position = "none")#the median age for diagnosed breast cancer agents is higher compared to people. Additionally,the interquartile ranges do not overlap significantly, so Age is likely to be a strong predictor in the logistic regression model.
#scatterplot to show distribution differences
ggplot(mam_mass, aes(x = factor(Severity), y = Shape, color = factor(Severity))) +
geom_jitter(alpha = 0.5, width = 0.2) + # alpha makes points transparent to see overlaps
scale_x_discrete(labels = c("No", "Yes")) +
labs(
title = "Severity Distribution by Tumor Shape",
x = "Breast Cancer Diagnosis",
y = "Tumor Shape",
color = "Severity Level",
caption = "Source: UK STEM Foundation"
) +
theme_minimal()#round tumor shape is especially prevalent in cancer free patients, and irregular tumor shape is prevalent in cancer diagnosed patients. This suggests correlation between shape and severityLogistic Regression
#create final model
logistic <- glm(Severity ~ Age + Density + Margin + Shape, data=mam_mass, family="binomial")#calculate model summary
summary(logistic)
Call:
glm(formula = Severity ~ Age + Density + Margin + Shape, family = "binomial",
data = mam_mass)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.73674 0.59896 -6.239 4.41e-10 ***
Age 0.05640 0.00842 6.698 2.11e-11 ***
Density.L -1.28144 0.75400 -1.700 0.0892 .
Density.Q -0.20804 0.59051 -0.352 0.7246
Density.C -0.72534 0.36667 -1.978 0.0479 *
Margin.L 0.78415 0.25377 3.090 0.0020 **
Margin.Q -0.62131 0.31740 -1.957 0.0503 .
Margin.C 0.64355 0.38194 1.685 0.0920 .
Shape.L 1.28768 0.26999 4.769 1.85e-06 ***
Shape.Q 0.52693 0.22317 2.361 0.0182 *
Shape.C -0.18446 0.24091 -0.766 0.4439
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 958.69 on 703 degrees of freedom
Residual deviance: 612.88 on 693 degrees of freedom
AIC: 634.88
Number of Fisher Scoring iterations: 5
Intercept, Log-odds Interpretation
The intercept means that when Age, Density, Shape, and Margin are zero, the log-odds of having breast cancer is -3.73674.
Odds Ratio, Strongest Predictors
All predictors raise the odds of breast cancer when there is a one-unit increase because their coefficients are all positive. A lobular tumor shape and age are the strongest predictors of breast cancer (pvalue > . 05).
Psuedo R-Sq
### r_square
r_square <- 1 - (logistic$deviance/logistic$null.deviance)
r_square[1] 0.3607115
#This means the model explains about 3.61% of the variation in breast cancer outcomes, based on all predictors (age, shape, density, margin)Confusion Matrix
#there are no missing values in the dataset, and Severity is already a numerical variable. no need to convert or filter for complete cases
#predicted probabilities
predicted.probs <- logistic$fitted.values
#predicted classes
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)
#cnfusion matrix
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(mam_mass$Severity, levels = c(0, 1))
)
confusion Actual
Predicted 0 1
0 335 63
1 72 234
Confusion Interpretation
335 people had no breast cancer, and the model said they had no breast cancer. (true negative)
63 people had breast cancer, but the model said they had no breast cancer. (false negative)
72 people had no breast cancer, but the model said that they had breast cancer. (false positive)
234 people had breast cancer, and the model said that they had breast cancer. (true positive)
Performance Matrix
#Extract Values:
TN <- 335
FP <- 63
FN <- 72
TP <- 234
#Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)
cat("Accuracy:", round(accuracy, 3), "\nSensitivity:", round(sensitivity, 3), "\nSpecificity:", round(specificity, 3), "\nPrecision:", round(precision, 3))Accuracy: 0.808
Sensitivity: 0.765
Specificity: 0.842
Precision: 0.788
Model Performance Interpretation The model has high accuracy (80.8%). The model is better at detecting true negatives (84.2%) over true positives (76.5%). The balance between detecting positives and avoiding false alarms is strong (3.5% difference).
Overall, these are good results for a binary classification model on medical data, but should realistically be better.
ROC Curve and AOC Value
# ROC curve & AUC on full data
roc_obj <- roc(response = mam_mass$Severity,
predictor = logistic$fitted.values,
levels = c("0", "1"),
direction = "<")
# Print AUC value
auc_val <- auc(roc_obj); auc_valArea under the curve: 0.8759
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")AUC and ROC Interpretation?
The AUC = .876 means your model is very good at distinguishing between breast cancer positive and breast cancer negative patients.
On the plot, the curve is far above the diagonal “random guess” line, which shows the model is much better than chance.
In plain words: if you randomly pick one breast cancer positive and one breast cancer negative patient, the model has about an 87.6% chance of ranking the breast cancer positive one higher.
Conclusion
In conclusion, age and a lobular tumor shape were the more significant predictors for breast cancer in the dataset. Using these results, the answer to my research question, “To what extent do age, mass shape, mass margin, and mass density predict breast mass severity?” is that tumor shape (lobular) and age contribute the greatest, and density the least. Additionally, the pseudo-R² (0.361) suggests the model is an excellent fit. The confusion matrix showed that most people in the dataset were correctly diagnosed for having or not having breast cancer, and the performance matrix showed a 80.8%% accuracy. More specifically, the model was slightly better at finding benign tumors compared to cancerous ones. If I were to further my research, I would add more predictor variables like patient weight or patient family history (to find a previous history of breast cancer).
References Elter, M. (2019). UCI Machine Learning Repository. UC Irvine Machine Learning Repository; University of California. https://archive.ics.uci.edu/dataset/161/mammographic+mass