Breast Cancer Prediction in R

Author

Richmond Silvanus Baye

Published

April 7, 2025

Predicting Breast Cancer with Machine Learning

A step-by-step walkthrough using multiple models and evaluation techniques. Predicting breast cancer using machine learning is not just about building a model and hitting “run” — it’s about understanding your data, choosing the right algorithms, and making sense of the results.

In this blog-style post, we’ll walk through the process we used to predict breast cancer using various machine learning classifiers. We’ll explore the challenges, compare models, and reflect on what works best. Think of it as a guided story rather than a purely technical manual.

This project focused on data pre-processing, model evaluation, visualization, and interpretation.

Alright, let’s begin by loading the various packages for the analysis.

pacman::p_load(randomForest, gbm, adabag, e1071, class, caret, readr, rpart, rpart.plot, kknn, dplyr, kernlab, BiocManager, pROC, pdp, rattle, inTrees, mlr, lightgbm)
# Load necessary libraries

The Challenge

We worked with the Breast Cancer Wisconsin dataset, a classic yet still highly relevant dataset for binary classification. The goal? To determine whether a tumor is benign or malignant based on various medical attributes. Although this may sound simple, it but begs the following question.

  • What happens when two different features contribute opposing signals?

  • How do we choose between simple, interpretable models and complex, non interpretable but highly performing?

These are the types of questions we tackle in this project.

# Set URL
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"

# Load the dataset
data <- read_csv(url, col_names = FALSE)

# View first few rows
head(data)
# A tibble: 6 × 11
       X1    X2    X3    X4    X5    X6 X7       X8    X9   X10   X11
    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1000025     5     1     1     1     2 1         3     1     1     2
2 1002945     5     4     4     5     7 10        3     2     1     2
3 1015425     3     1     1     1     2 2         3     1     1     2
4 1016277     6     8     8     1     3 4         3     7     1     2
5 1017023     4     1     1     3     2 1         3     1     1     2
6 1017122     8    10    10     8     7 10        9     7     1     4

Data Preparation

We pulled the data directly from the UCI Machine Learning Repository and handled a few important cleaning steps:

Defining proper names for the respective columns in the dataset.

colnames(data) <- c(
  "Sample_code_number",
  "Clump_Thickness",
  "Uniformity_of_Cell_Size",
  "Uniformity_of_Cell_Shape",
  "Marginal_Adhesion",
  "Single_Epithelial_Cell_Size",
  "Bare_Nuclei",
  "Bland_Chromatin",
  "Normal_Nucleoli",
  "Mitoses",
  "Class"
)
data
# A tibble: 699 × 11
   Sample_code_number Clump_Thickness Uniformity_of_Cell_Size
                <dbl>           <dbl>                   <dbl>
 1            1000025               5                       1
 2            1002945               5                       4
 3            1015425               3                       1
 4            1016277               6                       8
 5            1017023               4                       1
 6            1017122               8                      10
 7            1018099               1                       1
 8            1018561               2                       1
 9            1033078               2                       1
10            1033078               4                       2
# ℹ 689 more rows
# ℹ 8 more variables: Uniformity_of_Cell_Shape <dbl>, Marginal_Adhesion <dbl>,
#   Single_Epithelial_Cell_Size <dbl>, Bare_Nuclei <chr>,
#   Bland_Chromatin <dbl>, Normal_Nucleoli <dbl>, Mitoses <dbl>, Class <dbl>

Dealing with missing values by first replacing “?” with “NA” in the Bare_Nuclei column.

We proceed with converting the variable Bare_Nuclei to numeric before removing all row with NAs from the dataset.

#Replace "?' with NA
data$Bare_Nuclei[data$Bare_Nuclei == "?"] <- NA

#convert Bare_Nuclei to numeric 
data$Bare_Nuclei <- as.numeric(data$Bare_Nuclei)

#Remove rows with NA
data <- na.omit(data)

The target variable is a string so we need to convert to a factor with appropriate labels (Benign vs. Malignant)

data$Class <- factor(data$Class, levels = c(2, 4), labels = c("Benign", "Malignant"))

Now, let’s double check if the data is cleaned as in contains missing variables or outliers.

#Double Check for missing values
colSums(is.na(data))
         Sample_code_number             Clump_Thickness 
                          0                           0 
    Uniformity_of_Cell_Size    Uniformity_of_Cell_Shape 
                          0                           0 
          Marginal_Adhesion Single_Epithelial_Cell_Size 
                          0                           0 
                Bare_Nuclei             Bland_Chromatin 
                          0                           0 
            Normal_Nucleoli                     Mitoses 
                          0                           0 
                      Class 
                          0 
#As you can see the column sum is 0 indiciating that there is no missing values in the data

Having checked for missing variables, we also have to be cognizant of potential duplicates in the dataset. To do that we check if there are unique entries in the dataset. The function returned 234 unique entries

#Checking for duplicates
sum(duplicated(data[, -1])) 
[1] 234

Descriptive Statistics

As we can see that there are about 683 observations in the dataset with about 65% of the cases are benign while about 35% of the cases are malignant

#Summary statistics for numeric columns 
summary(data)
 Sample_code_number Clump_Thickness  Uniformity_of_Cell_Size
 Min.   :   63375   Min.   : 1.000   Min.   : 1.000         
 1st Qu.:  877617   1st Qu.: 2.000   1st Qu.: 1.000         
 Median : 1171795   Median : 4.000   Median : 1.000         
 Mean   : 1076720   Mean   : 4.442   Mean   : 3.151         
 3rd Qu.: 1238705   3rd Qu.: 6.000   3rd Qu.: 5.000         
 Max.   :13454352   Max.   :10.000   Max.   :10.000         
 Uniformity_of_Cell_Shape Marginal_Adhesion Single_Epithelial_Cell_Size
 Min.   : 1.000           Min.   : 1.00     Min.   : 1.000             
 1st Qu.: 1.000           1st Qu.: 1.00     1st Qu.: 2.000             
 Median : 1.000           Median : 1.00     Median : 2.000             
 Mean   : 3.215           Mean   : 2.83     Mean   : 3.234             
 3rd Qu.: 5.000           3rd Qu.: 4.00     3rd Qu.: 4.000             
 Max.   :10.000           Max.   :10.00     Max.   :10.000             
  Bare_Nuclei     Bland_Chromatin  Normal_Nucleoli    Mitoses      
 Min.   : 1.000   Min.   : 1.000   Min.   : 1.00   Min.   : 1.000  
 1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.: 1.000  
 Median : 1.000   Median : 3.000   Median : 1.00   Median : 1.000  
 Mean   : 3.545   Mean   : 3.445   Mean   : 2.87   Mean   : 1.603  
 3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00   3rd Qu.: 1.000  
 Max.   :10.000   Max.   :10.000   Max.   :10.00   Max.   :10.000  
       Class    
 Benign   :444  
 Malignant:239  
                
                
                
                
table(data$Class)

   Benign Malignant 
      444       239 
prop.table(table(data$Class))

   Benign Malignant 
0.6500732 0.3499268 

Now that our data is clean - with no missing values or obvious outliers - and we’ve confirmed that the target variable is categorical (Benign vs. Malignant), we’re ready to move into the modeling phase. But before we dive in, there’s one small but important detail: setting a random seed. This ensures that our results are reproducible every time we run the code.

Next, we split the dataset into two parts: 70% for training the models and 30% for testing how well they perform on unseen data. This split helps us evaluate the model’s ability to generalize to new cases, which is critical in a real-world medical setting.

set.seed(123)

trainingIndex <- createDataPartition(data$Class, p = 0.7, list = FALSE)
trainData <- data[trainingIndex, -1] #this will exclude the ID variable
testData <- data[-trainingIndex, -1]

With our training and testing sets in place, the next step is to start building our models. But before jumping into that, it’s important to introduce one more safeguard: cross-validation.

Cross-validation allows us to assess how well our models are likely to perform on new, unseen data by testing them across multiple folds of the training set. This helps reduce the risk of overfitting and gives us a more reliable estimate of performance before we even touch the test set.

set.seed(123)

#Define a task
task <- makeClassifTask(data = trainData, target = "Class")

#Define learners 
logistic <- makeLearner("classif.logreg", predict.type = "prob")
knn <- makeLearner("classif.kknn", predict.type = "prob")
tree <- makeLearner("classif.rpart", predict.type = "prob")
svm <- makeLearner("classif.svm", predict.type = "prob", kernel = "radial")
rf <- makeLearner("classif.randomForest", predict.type = "prob")

lgbm <- makeLearner("classif.gbm", predict.type = "prob")

To make it easier to compare the performance of all our models side by side, we’ll store each one in a single list. This way, we can loop through them for evaluation, plotting, and performance tracking without repeating code or creating unnecessary clutter.

Now we can implement all the models. We do no need to implement a hyper-parameter tuning at this point. This is solely for benchmarking. After selecting the best model we can now implement a hyper parameter tuning to improve model performance.

The Models We Explored

As mentioned earlier, we didn’t want to rely on a single algorithm. Instead, we tried a diverse set of models:

Model Description
Logistic Regression A classic baseline model, simple and interpretable
K-Nearest Neighbors A non-parametric model based on distance
Decision Tree A rule-based model that splits on feature thresholds
Support Vector Machine A margin-based classifier using RBF kernel
Random Forest An ensemble of decision trees, robust and powerful
LightGBM A gradient descent algorithm optimized for speed and performance

Each of these brings strengths and weaknesses to the table. So how do they compare?

#list all the learners

learners <- list(logistic, knn, tree, svm, rf, lgbm)

names(learners) <- c("Logistic", "KNN", "Tree", "SVM", "RF", "LightGBM")
#Define cross validation 
cv = makeResampleDesc("CV", iters = 5, stratify = TRUE)

#Performance measures
measure <- auc

#benchmark all models 
bmr <- benchmark(learners, task, cv, measure)
Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...
# Use AUC as the performance measure
measure <- auc
plotBMRBoxplots(bmr)

Key Metrics

# Apply best model to test data
test_task <- makeClassifTask(data = testData, target = "Class")
lgbm_model <- train(lgbm, task)
Distribution not specified, assuming bernoulli ...
lgbm_pred <- predict(lgbm_model, test_task)
performance(lgbm_pred, measure)
      auc 
0.9922694 
plotBMRSummary(bmr, pretty.names = TRUE)

# Convert benchmark result to threshold-vs-performance data
perf_data <- generateThreshVsPerfData(bmr, measures = list(fpr, tpr))

# Now plot ROC curves
plotROCCurves(
  perf_data,
  diagonal = TRUE,
  pretty.names = TRUE,
  facet.learner = FALSE
)

The ROC curve above compares the performance of all six classification models by plotting their true positive rate (sensitivity) against their false positive rate. Each colored line represents a different learner: logistic regression, k-nearest neighbors (KNN), decision tree (rpart), support vector machine (SVM), random forest, and gradient boosting (GBM). We can see that most models perform exceptionally well, with curves hugging the top-left corner — which indicates a high true positive rate and a low false positive rate. This is exactly what we want in a classification task. Notably, the gradient boosting model (classif.gbm) and random forest (classif.randomForest) show nearly perfect separation between the classes, closely followed by logistic regression (classif.logreg). The decision tree (classif.rpart), while still performing well, shows a slightly lower curve, suggesting it’s more prone to false positives or false negatives compared to the ensemble models.

getBMRAggrPerformances(bmr, as.df = TRUE)
    task.id           learner.id auc.test.mean
1 trainData       classif.logreg     0.9941754
2 trainData         classif.kknn     0.9868864
3 trainData        classif.rpart     0.9537249
4 trainData          classif.svm     0.9906273
5 trainData classif.randomForest     0.9922474
6 trainData          classif.gbm     0.9907797

The Precision-Recall (PR) curve above provides another lens through which to evaluate our models, particularly in the context of class imbalance. Unlike the ROC curve, which focuses on true vs. false positive rates, this plot emphasizes the trade-off between precision (positive predictive value) and recall (true positive rate). In medical diagnostics, this is crucial — we want to correctly identify as many malignant cases as possible, while minimizing false alarms.

From the plot, we see that most models, including GBM (classif.gbm), random forest, and logistic regression, maintain high precision across a wide range of recall values — indicating strong performance. The SVM model (classif.svm) notably hugs the top of the plot, suggesting exceptionally high precision, especially at higher recall levels. Meanwhile, KNN (classif.kknn) shows more gradual growth, indicating it’s less precise across the board, particularly in the mid-range of recall.

pr_data <- generateThreshVsPerfData(bmr, measures = list(tpr, ppv))

plotROCCurves(pr_data, pretty.names = TRUE, facet.learner = FALSE)

Now let’s focus only on the decision tree to understand how the cases (malignant and benign) cases were predicted.

tree_model <- train(learners$Tree, task)
rpart.plot(tree_model$learner.model, 
           type = 3,           # show variable names and splits
           extra = 101,        # show probabilities and percentages
           fallen.leaves = FALSE,
           main = "Decision Tree - rpart")

Explanation

The decision tree above helps us classify tumors as either benign or malignant based on a sequence of simple yes/no questions about cell characteristics. It begins by asking whether the Uniformity of Cell Size is less than 3.5. If the answer is no (i.e., the cell size is larger and more consistent), the model confidently predicts the tumor as malignant, since the majority of such cases fall into that category. If the answer is yes (cell size is smaller or more varied), the model digs deeper—asking additional questions about Bare Nuclei, Uniformity of Cell Shape, and Clump Thickness to refine the prediction.

For example, when Bare Nuclei is less than 2.5, the model predicts benign with high accuracy. But if it’s higher, it looks at other features to decide. Each path through the tree ends in a leaf node that shows the predicted class, the number of benign vs. malignant cases, and what percentage of the dataset falls into that category. This structure makes the model highly interpretable and provides insights into which features are most influential in determining cancer risk. It’s a great example of how even a simple model can uncover meaningful patterns in medical data.