This report analyzes a breast cancer diagnostic dataset where each observation corresponds to a tumor sample and the goal is to classify the tumor as benign (B) or malignant (M) based on quantitative measurements of cell nuclei.
Research questions
The input file for this project is the uploaded dataset:
breast-cancer.csv.xls (despite the .xls
extension, it is a comma-separated text file)pkgs <- c("readr", "dplyr", "tidyr", "ggplot2", "stringr", "purrr", "tibble")
to_install <- pkgs[!vapply(pkgs, requireNamespace, logical(1), quietly = TRUE)]
if (length(to_install) > 0) install.packages(to_install, repos = "https://cloud.r-project.org")
# Modeling + ROC
if (!requireNamespace("e1071", quietly = TRUE)) install.packages("e1071", repos = "https://cloud.r-project.org")
if (!requireNamespace("pROC", quietly = TRUE)) install.packages("pROC", repos = "https://cloud.r-project.org")
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(purrr)
library(tibble)
library(e1071)
library(pROC)
raw_path <- "breast-cancer.csv.xls"
stopifnot(file.exists(raw_path))
raw <- readr::read_csv(raw_path, show_col_types = FALSE)
dim(raw)
## [1] 569 32
names(raw)
## [1] "id" "diagnosis"
## [3] "radius_mean" "texture_mean"
## [5] "perimeter_mean" "area_mean"
## [7] "smoothness_mean" "compactness_mean"
## [9] "concavity_mean" "concave points_mean"
## [11] "symmetry_mean" "fractal_dimension_mean"
## [13] "radius_se" "texture_se"
## [15] "perimeter_se" "area_se"
## [17] "smoothness_se" "compactness_se"
## [19] "concavity_se" "concave points_se"
## [21] "symmetry_se" "fractal_dimension_se"
## [23] "radius_worst" "texture_worst"
## [25] "perimeter_worst" "area_worst"
## [27] "smoothness_worst" "compactness_worst"
## [29] "concavity_worst" "concave points_worst"
## [31] "symmetry_worst" "fractal_dimension_worst"
The following checks are done strictly in code:
diagnosis)id is present)# Required outcome column
stopifnot("diagnosis" %in% names(raw))
# ID checks
if ("id" %in% names(raw)) {
n_dups <- sum(duplicated(raw$id))
n_dups
}
## [1] 0
# Missingness by column
na_by_col <- sapply(raw, function(x) sum(is.na(x)))
na_by_col[order(na_by_col, decreasing = TRUE)][1:min(10, length(na_by_col))]
## id diagnosis radius_mean texture_mean
## 0 0 0 0
## perimeter_mean area_mean smoothness_mean compactness_mean
## 0 0 0 0
## concavity_mean concave points_mean
## 0 0
# Completely empty columns (common placeholder in some copies of this dataset)
all_na_cols <- names(na_by_col)[na_by_col == nrow(raw)]
all_na_cols
## character(0)
# Check that outcome values are expected
diag_vals <- sort(unique(as.character(raw$diagnosis)))
diag_vals
## [1] "B" "M"
What this step does: This block confirms that the
dataset has the outcome column (diagnosis), checks whether
an ID column contains duplicates, and counts missing values in every
column. If the file has a placeholder column that is completely empty,
it is identified here so it can be removed during cleaning.
Cleaning steps:
diagnosis to a factor with
levels B and M.df <- raw
# 1) Drop fully empty columns
if (length(all_na_cols) > 0) df <- df %>% select(-all_of(all_na_cols))
# Keep ID separately if present
if ("id" %in% names(df)) {
df <- df %>% mutate(.id = id) %>% select(-id)
} else {
df <- df %>% mutate(.id = row_number())
}
# 2) Outcome cleaning
df <- df %>%
mutate(diagnosis = as.character(diagnosis)) %>%
mutate(diagnosis = str_trim(diagnosis)) %>%
mutate(diagnosis = factor(diagnosis, levels = c("B", "M")))
if (any(is.na(df$diagnosis))) {
bad_rows <- which(is.na(df$diagnosis))[1:min(10, sum(is.na(df$diagnosis)))]
print(df[bad_rows, c(".id", "diagnosis")])
stop("Diagnosis contains unexpected values; see printed rows.")
}
# 3) Identify predictor columns
predictor_cols <- setdiff(names(df), c(".id", "diagnosis"))
# 4) Coerce predictors to numeric
df <- df %>% mutate(across(all_of(predictor_cols), ~ suppressWarnings(as.numeric(.x))))
# Quantify NAs after coercion (should be 0 for this dataset)
na_after <- sapply(df[, predictor_cols, drop = FALSE], function(x) sum(is.na(x)))
summary(na_after)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
# 5) Keep complete cases and finite values
df <- df %>%
filter(if_all(all_of(predictor_cols), ~ !is.na(.x) & is.finite(.x))) %>%
distinct(.id, .keep_all = TRUE)
dim(df)
## [1] 569 32
table(df$diagnosis)
##
## B M
## 357 212
What this step does: This block removes empty
columns, standardizes the outcome into two categories (B and M),
converts all predictor columns to numeric, and filters to complete,
finite rows. After this, df is the single cleaned table
used for every plot and model, which keeps the analysis consistent and
reproducible.
ggplot(df, aes(x = diagnosis)) +
geom_bar() +
labs(title = "Class balance", x = "Diagnosis", y = "Count")
What this graph shows: This bar chart shows how many benign vs malignant samples are in the dataset. If the classes are imbalanced, accuracy can be misleading, so later I also report ROC/AUC to summarize performance across thresholds.
Comment: The dataset has more benign than malignant cases, which is important to keep in mind when interpreting accuracy.
selected <- c("radius_mean", "texture_mean", "perimeter_mean", "area_mean",
"compactness_mean", "concavity_mean")
selected <- selected[selected %in% predictor_cols]
eda_long <- df %>%
select(diagnosis, all_of(selected)) %>%
pivot_longer(cols = all_of(selected), names_to = "feature", values_to = "value")
ggplot(eda_long, aes(x = value)) +
geom_histogram(bins = 40) +
facet_wrap(~feature, scales = "free") +
labs(title = "Distributions of selected features", x = "Value", y = "Count")
ggplot(eda_long, aes(x = diagnosis, y = value)) +
geom_boxplot(outlier.alpha = 0.2) +
facet_wrap(~feature, scales = "free_y") +
labs(title = "Selected features by diagnosis", x = "Diagnosis", y = "Value")
What these graphs show: The histograms summarize the distribution of each selected feature across all samples, which helps check scale and outliers. The boxplots compare benign vs malignant directly. When the malignant boxes are shifted higher or lower than benign, that feature is a strong candidate predictor and should also appear near the top of the statistical ranking later.
Comment: Several size-related features show clear shifts between malignant and benign groups, suggesting that a classifier should have useful signal.
set.seed(320)
corr_feats <- predictor_cols
if (length(corr_feats) > 15) corr_feats <- sample(corr_feats, 15)
C <- cor(as.matrix(df[, corr_feats, drop = FALSE]))
C_long <- as.data.frame(as.table(C)) %>% rename(f1 = Var1, f2 = Var2, r = Freq)
ggplot(C_long, aes(x = f1, y = f2, fill = r)) +
geom_tile() +
labs(title = "Correlation heatmap (random subset of features)", x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
What this graph shows: This heatmap summarizes correlations among a subset of features. Many features move together because they describe related geometric properties. High correlation means some predictors carry overlapping information, which is why models like logistic regression and linear SVM are good baselines here.
Comment: Many predictors are correlated, which is typical for measurements derived from related geometric properties.
To address the first research question, each feature is compared between malignant and benign groups using a two-sample t-test. P-values are adjusted with BH/FDR to account for multiple testing.
tt <- purrr::map_dfr(predictor_cols, function(f) {
x <- df[[f]]
p <- t.test(x ~ df$diagnosis)$p.value
tibble(
feature = f,
mean_B = mean(x[df$diagnosis == "B"]),
mean_M = mean(x[df$diagnosis == "M"]),
diff_mean = mean_M - mean_B,
p_value = p
)
}) %>%
mutate(p_adj = p.adjust(p_value, method = "BH")) %>%
arrange(p_adj)
tt %>% slice(1:15)
## # A tibble: 15 × 6
## feature mean_B mean_M diff_mean p_value p_adj
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 concave points_worst 0.0744 0.182 0.108 1.06e-96 3.18e-95
## 2 perimeter_worst 87.0 141. 54.4 1.03e-72 1.55e-71
## 3 concave points_mean 0.0257 0.0880 0.0623 3.13e-71 2.67e-70
## 4 radius_worst 13.4 21.1 7.76 3.56e-71 2.67e-70
## 5 perimeter_mean 78.1 115. 37.3 1.02e-66 6.14e-66
## 6 radius_mean 12.1 17.5 5.32 1.68e-64 8.42e-64
## 7 concavity_worst 0.166 0.451 0.284 9.85e-59 4.22e-58
## 8 concavity_mean 0.0461 0.161 0.115 3.74e-58 1.40e-57
## 9 area_worst 559. 1422. 863. 4.94e-54 1.65e-53
## 10 area_mean 463. 978. 516. 3.28e-52 9.85e-52
## 11 compactness_mean 0.0801 0.145 0.0651 9.61e-42 2.62e-41
## 12 compactness_worst 0.183 0.375 0.192 1.75e-38 4.37e-38
## 13 radius_se 0.284 0.609 0.325 1.49e-30 3.44e-30
## 14 texture_worst 23.5 29.3 5.80 5.20e-30 1.11e-29
## 15 perimeter_se 2.00 4.32 2.32 6.87e-29 1.37e-28
What this step does: Here I run a two-sample t-test for each feature to compare malignant vs benign means. Because many tests are run, I adjust p-values using the BH/FDR method. The top rows of the resulting table highlight which measurements differ most strongly between the two groups.
tt2 <- tt %>% mutate(neglog10p = -log10(p_value))
ggplot(tt2, aes(x = diff_mean, y = neglog10p)) +
geom_point(alpha = 0.7) +
labs(
title = "Feature differences (malignant - benign) vs significance",
x = "Difference in means",
y = "-log10(p-value)"
)
What this graph shows: Each point is a feature. The x-axis is the mean difference (malignant minus benign), so points farther from zero have larger effects. The y-axis is statistical evidence (smaller p-values appear higher). Features that are both far from zero and high on the plot are the most clearly separated between the two diagnoses.
Comment: Features with large positive mean differences indicate measurements that tend to be larger for malignant cases in this dataset.
This section answers the second research question with two models:
The split is stratified to ensure both classes appear in training and test sets.
set.seed(320)
train <- df %>%
group_by(diagnosis) %>%
slice_sample(prop = 0.8) %>%
ungroup()
test <- df %>% anti_join(train %>% select(.id), by = ".id")
# Ensure both classes appear
table(train$diagnosis)
##
## B M
## 285 169
table(test$diagnosis)
##
## B M
## 72 43
X_train <- as.matrix(train[, predictor_cols, drop = FALSE])
X_test <- as.matrix(test[, predictor_cols, drop = FALSE])
y_train <- train$diagnosis
y_test <- test$diagnosis
# Standardize for SVM using training statistics
mu <- colMeans(X_train)
sdv <- apply(X_train, 2, sd)
sdv[sdv == 0] <- 1
X_train_sc <- scale(X_train, center = mu, scale = sdv)
X_test_sc <- scale(X_test, center = mu, scale = sdv)
# Numeric label for ROC with logistic regression
test$y01 <- if_else(test$diagnosis == "M", 1L, 0L)
What this step does: This creates a stratified 80/20 train-test split so that both benign and malignant cases appear in each set. I also standardize predictors (using training means and standard deviations) for the SVM model, because SVM performance depends on features being on comparable scales.
logit_fit <- glm(diagnosis ~ ., data = train %>% select(diagnosis, all_of(predictor_cols)), family = binomial())
logit_prob <- predict(logit_fit, newdata = test, type = "response")
logit_pred <- if_else(logit_prob >= 0.5, "M", "B") %>% factor(levels = c("B","M"))
logit_conf <- table(truth = y_test, pred = logit_pred)
logit_conf
## pred
## truth B M
## B 67 5
## M 6 37
logit_acc <- sum(diag(logit_conf)) / sum(logit_conf)
logit_acc
## [1] 0.9043478
# ROC/AUC (requires both classes in test set)
if (length(unique(test$y01)) == 2) {
logit_roc <- pROC::roc(response = test$y01, predictor = logit_prob, levels = c(0, 1), quiet = TRUE)
logit_auc <- as.numeric(pROC::auc(logit_roc))
logit_auc
} else {
logit_auc <- NA_real_
message("ROC/AUC not computed because the test set contains only one class.")
}
## [1] 0.9114987
What this step does and what to look for: Logistic regression estimates the probability a tumor is malignant based on all predictors. The confusion matrix shows correct vs incorrect classifications at a 0.5 threshold, and accuracy summarizes overall correctness. ROC/AUC summarizes how well the probabilities separate the two classes across all thresholds; higher AUC means better separation.
svm_fit <- e1071::svm(
x = X_train_sc,
y = y_train,
kernel = "linear",
probability = TRUE
)
svm_pred <- predict(svm_fit, X_test_sc, probability = TRUE)
svm_prob <- attr(svm_pred, "probabilities")[, "M"]
svm_conf <- table(truth = y_test, pred = svm_pred)
svm_conf
## pred
## truth B M
## B 72 0
## M 2 41
svm_acc <- sum(diag(svm_conf)) / sum(svm_conf)
svm_acc
## [1] 0.9826087
svm_roc <- pROC::roc(response = y_test, predictor = svm_prob, levels = c("B","M"), quiet = TRUE)
svm_auc <- as.numeric(pROC::auc(svm_roc))
svm_auc
## [1] 0.994186
plot(svm_roc, main = sprintf("SVM ROC curve (AUC = %.3f)", svm_auc))
What this step does and what to look for: The linear SVM is trained on standardized predictors and produces class predictions and probabilities. I report the confusion matrix and accuracy, and then I plot the ROC curve and compute AUC. If SVM AUC is higher than logistic regression AUC, it indicates better overall separation between malignant and benign in the test set.
write_csv(df %>% select(.id, diagnosis, all_of(predictor_cols)), "breast_cancer_cleaned.csv")
"breast_cancer_cleaned.csv"
## [1] "breast_cancer_cleaned.csv"
What this step does: This writes out the cleaned dataset used for the analysis. Saving the cleaned table is useful for reproducibility and makes it easy to re-run models or create additional plots without repeating the cleaning steps.
In this project, I compared both malignant and benign tumors using quantitative cell-nuclei measurements. I also evaluated predictive models for diagnoses.
Some key components and findings are:
After BH/FDR adjustment, many measurements tend to differ significantly between malignant and benign samples. Size-related features tend to show the largest mean difference.
In my predictive modeling task, logistic regression achieved accuracy 0.904 on the test set (AUC 0.911). This is in comparison to a linear SVM, which achieved accuracy 0.983 (AUC 0.994).
Overall, these results indicate that these measurements provided above contain strong diagnostic signal for both malignant vs benign classification; at least within this dataset provided.
Breast cancer dataset source: Kaggle — “Breast Cancer Dataset” by yasserh. https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset