Dataset name: Breast Cancer Wisconsin (Diagnostic).
release date: 1993
Authors: W. Street, W. Wolberg, O. Mangasarian
Taken from: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
df <- read.csv("wdbc.data", header = FALSE)
colnames(df) <- c(
"id", "diagnosis",
"radius_mean", "texture_mean", "perimeter_mean", "area_mean",
"smoothness_mean", "compactness_mean", "concavity_mean",
"concave.points_mean", "symmetry_mean", "fractal_dimension_mean",
"radius_se", "texture_se", "perimeter_se", "area_se",
"smoothness_se", "compactness_se", "concavity_se",
"concave.points_se", "symmetry_se", "fractal_dimension_se",
"radius_worst", "texture_worst", "perimeter_worst", "area_worst",
"smoothness_worst", "compactness_worst", "concavity_worst",
"concave.points_worst", "symmetry_worst", "fractal_dimension_worst"
)
Clean dataset,convert diagnosis into a factor and take off id.
df <- df %>%
select(-id, -matches("^X$"), everything())
df$diagnosis <- factor(df$diagnosis, levels = c("B", "M"))
table(df$diagnosis)
##
## B M
## 357 212
Histograms from some key variables.
numeric_cols <- df %>%
select(where(is.numeric)) %>%
colnames()
df %>%
select(radius_mean, texture_mean, perimeter_mean, area_mean,
smoothness_mean, compactness_mean) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 25) +
facet_wrap(~ variable, scales = "free") +
theme_minimal()
Correlations
mean_vars <- c("radius_mean", "texture_mean", "perimeter_mean", "area_mean",
"smoothness_mean", "compactness_mean", "concavity_mean",
"concave.points_mean", "symmetry_mean", "fractal_dimension_mean")
corr_mat <- cor(df[, mean_vars])
corr_df <- as.data.frame(as.table(corr_mat))
ggplot(corr_df, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
scale_fill_gradient2() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "", y = "", fill = "Correlation",
title = "Correlation matrix of mean ")
Use logistic regression to model the probability that a tumor is malignant (M) based on its morphological characteristics.
For the parametric analysis, I use logistic regression, given that the outcome of interest (the diagnosis) is binary. Thus, each tumor is either benign (B) or malignant (M). Furthermore, logistic regression models the logarithm of the malignancy odds ratio as a linear combination of the predictors and is the standard parametric method for binary classification problems. Therefore, in this context, the model allows me to estimate how changes in morphological characteristics are associated with a higher or lower probability of a tumor being malignant.
iqr_outliers <- function(x, k = 1.5) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
lower <- q1 - k * iqr
upper <- q3 + k * iqr
(x < lower) | (x > upper)
}
out_radius <- iqr_outliers(df$radius_mean)
out_area <- iqr_outliers(df$area_mean)
sum(out_radius | out_area)
## [1] 25
df_clean <- df[!(out_radius | out_area), ]
dim(df)
## [1] 569 32
dim(df_clean)
## [1] 544 32
Applying the interquartile range (IQR) rule to
radius_meanandarea_mean, I identified 25 observations, approximately 4.4% of the dataset, as potential outliers. These cases correspond to tumors with an extremely large radius and area compared to the rest of the sample. Therefore, to reduce the influence of extreme values on the parameter estimates, I removed these 25 observations and performed subsequent analyses on the cleaned dataset (df_clean), which contains 544 tumors and 32 variables.
shapiro.test(df_clean$radius_mean)
##
## Shapiro-Wilk normality test
##
## data: df_clean$radius_mean
## W = 0.96064, p-value = 6.97e-11
car::leveneTest(radius_mean ~ diagnosis, data = df_clean)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 49.373 6.382e-12 ***
## 542
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Shapiro-Wilk test for radius_mean yields a very small p-value (< 0.05), indicating that this predictor does not follow a normal distribution. Furthermore, Levene’s test is highly significant, showing that the variance of radius_mean differs between the benign and malignant groups.
However, in logistic regression, predictors are not required to be normally distributed or to have equal variances across outcome categories. What is important is that the logit model is well-specified and that there is no perfect separation. Therefore, these deviations from normality and homoscedasticity are not critical, although they should be considered when interpreting the results.
features_logit <- mean_vars
logit_data <- df_clean %>%
select(diagnosis, all_of(features_logit))
lm_aux <- lm(as.numeric(diagnosis) - 1 ~ ., data = logit_data)
vif(lm_aux)
## radius_mean texture_mean perimeter_mean
## 1367.833052 1.189109 1538.956093
## area_mean smoothness_mean compactness_mean
## 85.126357 3.034723 21.239903
## concavity_mean concave.points_mean symmetry_mean
## 9.670277 17.415204 1.754517
## fractal_dimension_mean
## 6.671479
The VIF values show very strong multicollinearity among several average variables, especially mean radius, mean perimeter, and mean area, with VIF well above 10. This makes sense because they all describe tumor size. Although high multicollinearity does not always worsen predictive ability, it can destabilize the coefficients and prevent a clear interpretation of each individual effect.
In this report, I retain all average variables in the logistic regression and focus on overall predictive performance, interpreting the coefficients cautiously. Additionally, I used nonparametric methods (decision tree and PCA) to summarize and visualize the main patterns without relying on independent linear predictors.
set.seed(123)
n <- nrow(logit_data)
train_idx <- sample(seq_len(n), size = 0.8 * n)
train_logit <- logit_data[train_idx, ]
test_logit <- logit_data[-train_idx, ]
prop.table(table(train_logit$diagnosis))
##
## B M
## 0.6413793 0.3586207
prop.table(table(test_logit$diagnosis))
##
## B M
## 0.7155963 0.2844037
logit_fit <- glm(diagnosis ~ ., data = train_logit, family = binomial)
summary(logit_fit)
##
## Call:
## glm(formula = diagnosis ~ ., family = binomial, data = train_logit)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.89195 13.85452 -1.364 0.1727
## radius_mean 3.17289 3.97726 0.798 0.4250
## texture_mean 0.35747 0.06924 5.163 2.43e-07 ***
## perimeter_mean -0.80022 0.53899 -1.485 0.1376
## area_mean 0.03600 0.01600 2.251 0.0244 *
## smoothness_mean 76.03380 34.71683 2.190 0.0285 *
## compactness_mean -26.28354 22.02873 -1.193 0.2328
## concavity_mean 33.69279 13.18361 2.556 0.0106 *
## concave.points_mean 69.05215 34.24559 2.016 0.0438 *
## symmetry_mean 21.68266 11.43845 1.896 0.0580 .
## fractal_dimension_mean 48.76549 101.02003 0.483 0.6293
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 567.78 on 434 degrees of freedom
## Residual deviance: 122.60 on 424 degrees of freedom
## AIC: 144.6
##
## Number of Fisher Scoring iterations: 8
Probabilities of being malignant (M)
Classification with a threshold of 0.5
prob_test <- predict(logit_fit, newdata = test_logit, type = "response")
pred_test <- ifelse(prob_test > 0.5, "M", "B")
pred_test <- factor(pred_test, levels = levels(test_logit$diagnosis))
conf_mat <- table(Actual = test_logit$diagnosis, Predicted = pred_test)
conf_mat
## Predicted
## Actual B M
## B 73 5
## M 1 30
accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
accuracy
## [1] 0.9449541
roc_obj <- roc(test_logit$diagnosis, prob_test, levels = c("B", "M"))
## Setting direction: controls < cases
auc(roc_obj)
## Area under the curve: 0.983
pseudo_R2 <- 1 - logit_fit$deviance / logit_fit$null.deviance
pseudo_R2
## [1] 0.7840678
The logistic regression model performed very well, correctly predicting tumor type in approximately 94.5% of cases in the test set, with few errors according to the confusion matrix. The AUC of 0.983 and McFadden’s pseudo-R² of nearly 0.78 indicate excellent ability to distinguish between benign and malignant tumors and good overall fit. Furthermore, higher values for area_mean, smoothness_mean, concavity_mean, and concave.points_mean increase the likelihood of malignancy, consistent with clinical intuition regarding larger and more irregular tumors.
tree_data <- df_clean %>%
select(diagnosis, all_of(features_logit))
set.seed(123)
n2 <- nrow(tree_data)
train_idx2 <- sample(seq_len(n2), size = 0.8 * n2)
train_tree <- tree_data[train_idx2, ]
test_tree <- tree_data[-train_idx2, ]
tree_fit <- rpart(diagnosis ~ ., data = train_tree,
method = "class",
control = rpart.control(maxdepth = 4, cp = 0.01))
rpart.plot(tree_fit, main = "Decision Tree for Breast Cancer Diagnosis")
tree_pred <- predict(tree_fit, newdata = test_tree, type = "class")
conf_mat_tree <- table(Actual = test_tree$diagnosis, Predicted = tree_pred)
conf_mat_tree
## Predicted
## Actual B M
## B 71 7
## M 4 27
tree_acc <- sum(diag(conf_mat_tree)) / sum(conf_mat_tree)
tree_acc
## [1] 0.8990826
The decision tree achieves an accuracy of approximately 89.9% in the test set, slightly lower than logistic regression, but it offers highly interpretable rules. Furthermore, the confusion matrix shows that most malignant tumors are correctly classified, with few false negatives and positives. The most important variables are area_mean, concave_points_mean, concavity_mean, perimeter_mean, and radius_mean, all related to tumor size and irregularity. The initial divisions of the tree primarily use thresholds for area_mean and concavity measurements, reinforcing the conclusion that larger tumors with more pronounced concavities are more likely to be malignant.
varImp(tree_fit)
## Overall
## area_mean 138.52217
## concave.points_mean 135.97780
## concavity_mean 115.32931
## perimeter_mean 130.47532
## radius_mean 133.95028
## texture_mean 10.02694
## smoothness_mean 0.00000
## compactness_mean 0.00000
## symmetry_mean 0.00000
## fractal_dimension_mean 0.00000
pca_data <- df_clean[, mean_vars]
pca_scaled <- scale(pca_data)
pca_fit <- prcomp(pca_scaled, center = TRUE, scale. = TRUE)
summary(pca_fit)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2931 1.6343 0.94370 0.71790 0.62974 0.37333 0.28044
## Proportion of Variance 0.5258 0.2671 0.08906 0.05154 0.03966 0.01394 0.00786
## Cumulative Proportion 0.5258 0.7929 0.88197 0.93351 0.97316 0.98710 0.99497
## PC8 PC9 PC10
## Standard deviation 0.20665 0.08531 0.01868
## Proportion of Variance 0.00427 0.00073 0.00003
## Cumulative Proportion 0.99924 0.99997 1.00000
var_explained <- pca_fit$sdev^2 / sum(pca_fit$sdev^2)
cum_var <- cumsum(var_explained)
pca_var_df <- data.frame(
PC = seq_along(var_explained),
VarExplained = var_explained,
CumVar = cum_var
)
pca_var_df
## PC VarExplained CumVar
## 1 1 5.258272e-01 0.5258272
## 2 2 2.670850e-01 0.7929122
## 3 3 8.905705e-02 0.8819692
## 4 4 5.153808e-02 0.9335073
## 5 5 3.965744e-02 0.9731647
## 6 6 1.393760e-02 0.9871023
## 7 7 7.864414e-03 0.9949668
## 8 8 4.270530e-03 0.9992373
## 9 9 7.278190e-04 0.9999651
## 10 10 3.489137e-05 1.0000000
plot(cum_var, type = "b",
xlab = "Number of principal components",
ylab = "Cumulative explained variance",
main = "PCA – Cumulative explained variance")
abline(h = 0.8, col = "red", lty = 2)
loadings <- pca_fit$rotation[, 1:2]
loadings
## PC1 PC2
## radius_mean -0.35631655 0.33780335
## texture_mean -0.15189037 0.14501832
## perimeter_mean -0.37186173 0.30611386
## area_mean -0.35814323 0.33093525
## smoothness_mean -0.22857339 -0.39246414
## compactness_mean -0.36939636 -0.26545809
## concavity_mean -0.39455302 -0.12644702
## concave.points_mean -0.42407619 -0.02405019
## symmetry_mean -0.22556081 -0.33880112
## fractal_dimension_mean -0.09127033 -0.55297847
pca_scores <- as.data.frame(pca_fit$x[, 1:2])
pca_scores$diagnosis <- df_clean$diagnosis
ggplot(pca_scores, aes(x = PC1, y = PC2, color = diagnosis)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Patients projected on first two principal components")
The PCA shows that PC1 explains approximately 52.6% of the variance and that two or three components capture nearly 80–90% of the total variability. That is, the 10 variables are well summarized in low dimensions. PC1 primarily reflects tumor size and irregularity (radius, perimeter, area, concavity), while PC2 is more closely associated with texture and border complexity. When the data are projected onto these components, benign and malignant tumors are clearly separated, indicating that a few latent dimensions are sufficient to distinguish them.
Both parametric and non-parametric methods offer a very consistent picture. Logistic regression demonstrates excellent performance (94.5% accuracy, AUC = 0.983, and a high pseudo-R²) and identifies tumor size and irregularity variables (area, radius, concavity, and concave points) as predictors. Furthermore, the decision tree, although somewhat less accurate (89.9%), generates easily interpretable rules based on these same characteristics, and PCA shows that 2–3 principal components, dominated by size and concavity, capture most of the variability and effectively separate benign and malignant cases.
Similarly, the main limitations are the strong multicollinearity among size measures and the use of a single dataset and imaging protocol. Despite this, all approaches agree that the morphological characteristics of the masses, especially those related to their size and borders, are very powerful predictors of malignancy in this context.