1 1. Problem statement and research questions
2 2. Data and import
3 3. Programmatic verification
4 4. Cleaning and preparation
5 5. Exploratory Data Analysis (EDA)
6 6. Statistical comparisons (feature-by-feature)
7 7. Predictive modeling
- 7.1 7.1 Logistic regression
- 7.2 7.2 Linear SVM
8 8. Export cleaned data
9 9. Conclusions
10 Appendix: Data citation

1 1. Problem statement and research questions

This report analyzes a breast cancer diagnostic dataset where each observation corresponds to a tumor sample and the goal is to classify the tumor as benign (B) or malignant (M) based on quantitative measurements of cell nuclei.

Research questions

Which measurements show the strongest differences between malignant and benign tumors?
How accurately can a simple predictive model classify malignant vs benign tumors using these measurements?

2 2. Data and import

The input file for this project is the uploaded dataset:

breast-cancer.csv.xls (despite the .xls extension, it is a comma-separated text file)

pkgs <- c("readr", "dplyr", "tidyr", "ggplot2", "stringr", "purrr", "tibble")
to_install <- pkgs[!vapply(pkgs, requireNamespace, logical(1), quietly = TRUE)]
if (length(to_install) > 0) install.packages(to_install, repos = "https://cloud.r-project.org")

# Modeling + ROC
if (!requireNamespace("e1071", quietly = TRUE)) install.packages("e1071", repos = "https://cloud.r-project.org")
if (!requireNamespace("pROC", quietly = TRUE)) install.packages("pROC", repos = "https://cloud.r-project.org")

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(purrr)
library(tibble)
library(e1071)
library(pROC)

raw_path <- "breast-cancer.csv.xls"
stopifnot(file.exists(raw_path))

raw <- readr::read_csv(raw_path, show_col_types = FALSE)

dim(raw)

## [1] 569  32

names(raw)

##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"

3 3. Programmatic verification

The following checks are done strictly in code:

Required columns exist (diagnosis)
Row count and unique IDs (if id is present)
Missingness per column
Detection of empty/placeholder columns
Type checks (numeric predictors, categorical outcome)
Out-of-range checks where meaningful (finite numeric values)

# Required outcome column
stopifnot("diagnosis" %in% names(raw))

# ID checks
if ("id" %in% names(raw)) {
  n_dups <- sum(duplicated(raw$id))
  n_dups
}

## [1] 0

# Missingness by column
na_by_col <- sapply(raw, function(x) sum(is.na(x)))
na_by_col[order(na_by_col, decreasing = TRUE)][1:min(10, length(na_by_col))]

##                  id           diagnosis         radius_mean        texture_mean 
##                   0                   0                   0                   0 
##      perimeter_mean           area_mean     smoothness_mean    compactness_mean 
##                   0                   0                   0                   0 
##      concavity_mean concave points_mean 
##                   0                   0

# Completely empty columns (common placeholder in some copies of this dataset)
all_na_cols <- names(na_by_col)[na_by_col == nrow(raw)]
all_na_cols

## character(0)

# Check that outcome values are expected
diag_vals <- sort(unique(as.character(raw$diagnosis)))
diag_vals

## [1] "B" "M"

What this step does: This block confirms that the dataset has the outcome column (diagnosis), checks whether an ID column contains duplicates, and counts missing values in every column. If the file has a placeholder column that is completely empty, it is identified here so it can be removed during cleaning.

4 4. Cleaning and preparation

Cleaning steps:

Drop any completely empty columns.
Standardize the outcome diagnosis to a factor with levels B and M.
Drop the ID column from predictors (kept only for integrity checks).
Coerce predictor columns to numeric; quantify any NAs created by coercion.
Keep complete cases for modeling and analysis.

df <- raw

# 1) Drop fully empty columns
if (length(all_na_cols) > 0) df <- df %>% select(-all_of(all_na_cols))

# Keep ID separately if present
if ("id" %in% names(df)) {
  df <- df %>% mutate(.id = id) %>% select(-id)
} else {
  df <- df %>% mutate(.id = row_number())
}

# 2) Outcome cleaning
df <- df %>%
  mutate(diagnosis = as.character(diagnosis)) %>%
  mutate(diagnosis = str_trim(diagnosis)) %>%
  mutate(diagnosis = factor(diagnosis, levels = c("B", "M")))

if (any(is.na(df$diagnosis))) {
  bad_rows <- which(is.na(df$diagnosis))[1:min(10, sum(is.na(df$diagnosis)))]
  print(df[bad_rows, c(".id", "diagnosis")])
  stop("Diagnosis contains unexpected values; see printed rows.")
}

# 3) Identify predictor columns
predictor_cols <- setdiff(names(df), c(".id", "diagnosis"))

# 4) Coerce predictors to numeric
df <- df %>% mutate(across(all_of(predictor_cols), ~ suppressWarnings(as.numeric(.x))))

# Quantify NAs after coercion (should be 0 for this dataset)
na_after <- sapply(df[, predictor_cols, drop = FALSE], function(x) sum(is.na(x)))
summary(na_after)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

# 5) Keep complete cases and finite values
df <- df %>%
  filter(if_all(all_of(predictor_cols), ~ !is.na(.x) & is.finite(.x))) %>%
  distinct(.id, .keep_all = TRUE)

dim(df)

## [1] 569  32

table(df$diagnosis)

## 
##   B   M 
## 357 212

What this step does: This block removes empty columns, standardizes the outcome into two categories (B and M), converts all predictor columns to numeric, and filters to complete, finite rows. After this, df is the single cleaned table used for every plot and model, which keeps the analysis consistent and reproducible.

5 5. Exploratory Data Analysis (EDA)

5.1 5.1 Class balance

ggplot(df, aes(x = diagnosis)) +
  geom_bar() +
  labs(title = "Class balance", x = "Diagnosis", y = "Count")

What this graph shows: This bar chart shows how many benign vs malignant samples are in the dataset. If the classes are imbalanced, accuracy can be misleading, so later I also report ROC/AUC to summarize performance across thresholds.

Comment: The dataset has more benign than malignant cases, which is important to keep in mind when interpreting accuracy.

5.2 5.2 Feature distributions and tumor/benign separation (selected features)

selected <- c("radius_mean", "texture_mean", "perimeter_mean", "area_mean",
              "compactness_mean", "concavity_mean")
selected <- selected[selected %in% predictor_cols]

eda_long <- df %>%
  select(diagnosis, all_of(selected)) %>%
  pivot_longer(cols = all_of(selected), names_to = "feature", values_to = "value")

ggplot(eda_long, aes(x = value)) +
  geom_histogram(bins = 40) +
  facet_wrap(~feature, scales = "free") +
  labs(title = "Distributions of selected features", x = "Value", y = "Count")

ggplot(eda_long, aes(x = diagnosis, y = value)) +
  geom_boxplot(outlier.alpha = 0.2) +
  facet_wrap(~feature, scales = "free_y") +
  labs(title = "Selected features by diagnosis", x = "Diagnosis", y = "Value")

What these graphs show: The histograms summarize the distribution of each selected feature across all samples, which helps check scale and outliers. The boxplots compare benign vs malignant directly. When the malignant boxes are shifted higher or lower than benign, that feature is a strong candidate predictor and should also appear near the top of the statistical ranking later.

Comment: Several size-related features show clear shifts between malignant and benign groups, suggesting that a classifier should have useful signal.

5.3 5.3 Correlation structure (subset)

set.seed(320)
corr_feats <- predictor_cols
if (length(corr_feats) > 15) corr_feats <- sample(corr_feats, 15)

C <- cor(as.matrix(df[, corr_feats, drop = FALSE]))
C_long <- as.data.frame(as.table(C)) %>% rename(f1 = Var1, f2 = Var2, r = Freq)

ggplot(C_long, aes(x = f1, y = f2, fill = r)) +
  geom_tile() +
  labs(title = "Correlation heatmap (random subset of features)", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

What this graph shows: This heatmap summarizes correlations among a subset of features. Many features move together because they describe related geometric properties. High correlation means some predictors carry overlapping information, which is why models like logistic regression and linear SVM are good baselines here.

Comment: Many predictors are correlated, which is typical for measurements derived from related geometric properties.

6 6. Statistical comparisons (feature-by-feature)

To address the first research question, each feature is compared between malignant and benign groups using a two-sample t-test. P-values are adjusted with BH/FDR to account for multiple testing.

tt <- purrr::map_dfr(predictor_cols, function(f) {
  x <- df[[f]]
  p <- t.test(x ~ df$diagnosis)$p.value
  tibble(
    feature = f,
    mean_B = mean(x[df$diagnosis == "B"]),
    mean_M = mean(x[df$diagnosis == "M"]),
    diff_mean = mean_M - mean_B,
    p_value = p
  )
}) %>%
  mutate(p_adj = p.adjust(p_value, method = "BH")) %>%
  arrange(p_adj)

tt %>% slice(1:15)

## # A tibble: 15 × 6
##    feature                mean_B    mean_M diff_mean  p_value    p_adj
##    <chr>                   <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
##  1 concave points_worst   0.0744    0.182     0.108  1.06e-96 3.18e-95
##  2 perimeter_worst       87.0     141.       54.4    1.03e-72 1.55e-71
##  3 concave points_mean    0.0257    0.0880    0.0623 3.13e-71 2.67e-70
##  4 radius_worst          13.4      21.1       7.76   3.56e-71 2.67e-70
##  5 perimeter_mean        78.1     115.       37.3    1.02e-66 6.14e-66
##  6 radius_mean           12.1      17.5       5.32   1.68e-64 8.42e-64
##  7 concavity_worst        0.166     0.451     0.284  9.85e-59 4.22e-58
##  8 concavity_mean         0.0461    0.161     0.115  3.74e-58 1.40e-57
##  9 area_worst           559.     1422.      863.     4.94e-54 1.65e-53
## 10 area_mean            463.      978.      516.     3.28e-52 9.85e-52
## 11 compactness_mean       0.0801    0.145     0.0651 9.61e-42 2.62e-41
## 12 compactness_worst      0.183     0.375     0.192  1.75e-38 4.37e-38
## 13 radius_se              0.284     0.609     0.325  1.49e-30 3.44e-30
## 14 texture_worst         23.5      29.3       5.80   5.20e-30 1.11e-29
## 15 perimeter_se           2.00      4.32      2.32   6.87e-29 1.37e-28

What this step does: Here I run a two-sample t-test for each feature to compare malignant vs benign means. Because many tests are run, I adjust p-values using the BH/FDR method. The top rows of the resulting table highlight which measurements differ most strongly between the two groups.

tt2 <- tt %>% mutate(neglog10p = -log10(p_value))

ggplot(tt2, aes(x = diff_mean, y = neglog10p)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Feature differences (malignant - benign) vs significance",
    x = "Difference in means",
    y = "-log10(p-value)"
  )

What this graph shows: Each point is a feature. The x-axis is the mean difference (malignant minus benign), so points farther from zero have larger effects. The y-axis is statistical evidence (smaller p-values appear higher). Features that are both far from zero and high on the plot are the most clearly separated between the two diagnoses.

Comment: Features with large positive mean differences indicate measurements that tend to be larger for malignant cases in this dataset.

7 7. Predictive modeling

This section answers the second research question with two models:

Logistic regression (baseline, interpretable)
Linear SVM (strong baseline for high-dimensional correlated features)

The split is stratified to ensure both classes appear in training and test sets.

set.seed(320)

train <- df %>%
  group_by(diagnosis) %>%
  slice_sample(prop = 0.8) %>%
  ungroup()

test <- df %>% anti_join(train %>% select(.id), by = ".id")

# Ensure both classes appear
table(train$diagnosis)

## 
##   B   M 
## 285 169

table(test$diagnosis)

## 
##  B  M 
## 72 43

X_train <- as.matrix(train[, predictor_cols, drop = FALSE])
X_test  <- as.matrix(test[, predictor_cols, drop = FALSE])

y_train <- train$diagnosis
y_test  <- test$diagnosis

# Standardize for SVM using training statistics
mu <- colMeans(X_train)
sdv <- apply(X_train, 2, sd)
sdv[sdv == 0] <- 1

X_train_sc <- scale(X_train, center = mu, scale = sdv)
X_test_sc  <- scale(X_test,  center = mu, scale = sdv)

# Numeric label for ROC with logistic regression
test$y01 <- if_else(test$diagnosis == "M", 1L, 0L)

What this step does: This creates a stratified 80/20 train-test split so that both benign and malignant cases appear in each set. I also standardize predictors (using training means and standard deviations) for the SVM model, because SVM performance depends on features being on comparable scales.

7.1 7.1 Logistic regression

logit_fit <- glm(diagnosis ~ ., data = train %>% select(diagnosis, all_of(predictor_cols)), family = binomial())

logit_prob <- predict(logit_fit, newdata = test, type = "response")
logit_pred <- if_else(logit_prob >= 0.5, "M", "B") %>% factor(levels = c("B","M"))

logit_conf <- table(truth = y_test, pred = logit_pred)
logit_conf

##      pred
## truth  B  M
##     B 67  5
##     M  6 37

logit_acc <- sum(diag(logit_conf)) / sum(logit_conf)
logit_acc

## [1] 0.9043478

# ROC/AUC (requires both classes in test set)
if (length(unique(test$y01)) == 2) {
  logit_roc <- pROC::roc(response = test$y01, predictor = logit_prob, levels = c(0, 1), quiet = TRUE)
  logit_auc <- as.numeric(pROC::auc(logit_roc))
  logit_auc
} else {
  logit_auc <- NA_real_
  message("ROC/AUC not computed because the test set contains only one class.")
}

## [1] 0.9114987

What this step does and what to look for: Logistic regression estimates the probability a tumor is malignant based on all predictors. The confusion matrix shows correct vs incorrect classifications at a 0.5 threshold, and accuracy summarizes overall correctness. ROC/AUC summarizes how well the probabilities separate the two classes across all thresholds; higher AUC means better separation.

7.2 7.2 Linear SVM

svm_fit <- e1071::svm(
  x = X_train_sc,
  y = y_train,
  kernel = "linear",
  probability = TRUE
)

svm_pred <- predict(svm_fit, X_test_sc, probability = TRUE)
svm_prob <- attr(svm_pred, "probabilities")[, "M"]

svm_conf <- table(truth = y_test, pred = svm_pred)
svm_conf

##      pred
## truth  B  M
##     B 72  0
##     M  2 41

svm_acc <- sum(diag(svm_conf)) / sum(svm_conf)
svm_acc

## [1] 0.9826087

svm_roc <- pROC::roc(response = y_test, predictor = svm_prob, levels = c("B","M"), quiet = TRUE)
svm_auc <- as.numeric(pROC::auc(svm_roc))
svm_auc

## [1] 0.994186

plot(svm_roc, main = sprintf("SVM ROC curve (AUC = %.3f)", svm_auc))

What this step does and what to look for: The linear SVM is trained on standardized predictors and produces class predictions and probabilities. I report the confusion matrix and accuracy, and then I plot the ROC curve and compute AUC. If SVM AUC is higher than logistic regression AUC, it indicates better overall separation between malignant and benign in the test set.

8 8. Export cleaned data

write_csv(df %>% select(.id, diagnosis, all_of(predictor_cols)), "breast_cancer_cleaned.csv")
"breast_cancer_cleaned.csv"

## [1] "breast_cancer_cleaned.csv"

What this step does: This writes out the cleaned dataset used for the analysis. Saving the cleaned table is useful for reproducibility and makes it easy to re-run models or create additional plots without repeating the cleaning steps.

9 9. Conclusions

In this project, I compared both malignant and benign tumors using quantitative cell-nuclei measurements. I also evaluated predictive models for diagnoses.

Some key components and findings are:

After BH/FDR adjustment, many measurements tend to differ significantly between malignant and benign samples. Size-related features tend to show the largest mean difference.
In my predictive modeling task, logistic regression achieved accuracy 0.904 on the test set (AUC 0.911). This is in comparison to a linear SVM, which achieved accuracy 0.983 (AUC 0.994).
Overall, these results indicate that these measurements provided above contain strong diagnostic signal for both malignant vs benign classification; at least within this dataset provided.

10 Appendix: Data citation

Breast cancer dataset source: Kaggle — “Breast Cancer Dataset” by yasserh. https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset

Project 1 - Breast Cancer Diagnosis from Cell-Nuclei Features

Kassie Shippey (ID: 1235092861)

2026-02-08