pacman::p_load(randomForest, gbm, adabag, e1071, class, caret, readr, rpart, rpart.plot, kknn, dplyr, kernlab, BiocManager, pROC, pdp, rattle, inTrees, mlr, lightgbm)
# Load necessary librariesBreast Cancer Prediction in R
Predicting Breast Cancer with Machine Learning
A step-by-step walkthrough using multiple models and evaluation techniques. Predicting breast cancer using machine learning is not just about building a model and hitting “run” — it’s about understanding your data, choosing the right algorithms, and making sense of the results.
In this blog-style post, we’ll walk through the process we used to predict breast cancer using various machine learning classifiers. We’ll explore the challenges, compare models, and reflect on what works best. Think of it as a guided story rather than a purely technical manual.
This project focused on data pre-processing, model evaluation, visualization, and interpretation.
Alright, let’s begin by loading the various packages for the analysis.
The Challenge
We worked with the Breast Cancer Wisconsin dataset, a classic yet still highly relevant dataset for binary classification. The goal? To determine whether a tumor is benign or malignant based on various medical attributes. Although this may sound simple, it but begs the following question.
What happens when two different features contribute opposing signals?
How do we choose between simple, interpretable models and complex, non interpretable but highly performing?
These are the types of questions we tackle in this project.
# Set URL
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
# Load the dataset
data <- read_csv(url, col_names = FALSE)
# View first few rows
head(data)# A tibble: 6 × 11
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1000025 5 1 1 1 2 1 3 1 1 2
2 1002945 5 4 4 5 7 10 3 2 1 2
3 1015425 3 1 1 1 2 2 3 1 1 2
4 1016277 6 8 8 1 3 4 3 7 1 2
5 1017023 4 1 1 3 2 1 3 1 1 2
6 1017122 8 10 10 8 7 10 9 7 1 4
Data Preparation
We pulled the data directly from the UCI Machine Learning Repository and handled a few important cleaning steps:
Defining proper names for the respective columns in the dataset.
colnames(data) <- c(
"Sample_code_number",
"Clump_Thickness",
"Uniformity_of_Cell_Size",
"Uniformity_of_Cell_Shape",
"Marginal_Adhesion",
"Single_Epithelial_Cell_Size",
"Bare_Nuclei",
"Bland_Chromatin",
"Normal_Nucleoli",
"Mitoses",
"Class"
)
data# A tibble: 699 × 11
Sample_code_number Clump_Thickness Uniformity_of_Cell_Size
<dbl> <dbl> <dbl>
1 1000025 5 1
2 1002945 5 4
3 1015425 3 1
4 1016277 6 8
5 1017023 4 1
6 1017122 8 10
7 1018099 1 1
8 1018561 2 1
9 1033078 2 1
10 1033078 4 2
# ℹ 689 more rows
# ℹ 8 more variables: Uniformity_of_Cell_Shape <dbl>, Marginal_Adhesion <dbl>,
# Single_Epithelial_Cell_Size <dbl>, Bare_Nuclei <chr>,
# Bland_Chromatin <dbl>, Normal_Nucleoli <dbl>, Mitoses <dbl>, Class <dbl>
Dealing with missing values by first replacing “?” with “NA” in the Bare_Nuclei column.
We proceed with converting the variable Bare_Nuclei to numeric before removing all row with NAs from the dataset.
#Replace "?' with NA
data$Bare_Nuclei[data$Bare_Nuclei == "?"] <- NA
#convert Bare_Nuclei to numeric
data$Bare_Nuclei <- as.numeric(data$Bare_Nuclei)
#Remove rows with NA
data <- na.omit(data)The target variable is a string so we need to convert to a factor with appropriate labels (Benign vs. Malignant)
data$Class <- factor(data$Class, levels = c(2, 4), labels = c("Benign", "Malignant"))Now, let’s double check if the data is cleaned as in contains missing variables or outliers.
#Double Check for missing values
colSums(is.na(data)) Sample_code_number Clump_Thickness
0 0
Uniformity_of_Cell_Size Uniformity_of_Cell_Shape
0 0
Marginal_Adhesion Single_Epithelial_Cell_Size
0 0
Bare_Nuclei Bland_Chromatin
0 0
Normal_Nucleoli Mitoses
0 0
Class
0
#As you can see the column sum is 0 indiciating that there is no missing values in the dataHaving checked for missing variables, we also have to be cognizant of potential duplicates in the dataset. To do that we check if there are unique entries in the dataset. The function returned 234 unique entries
#Checking for duplicates
sum(duplicated(data[, -1])) [1] 234
Descriptive Statistics
As we can see that there are about 683 observations in the dataset with about 65% of the cases are benign while about 35% of the cases are malignant
#Summary statistics for numeric columns
summary(data) Sample_code_number Clump_Thickness Uniformity_of_Cell_Size
Min. : 63375 Min. : 1.000 Min. : 1.000
1st Qu.: 877617 1st Qu.: 2.000 1st Qu.: 1.000
Median : 1171795 Median : 4.000 Median : 1.000
Mean : 1076720 Mean : 4.442 Mean : 3.151
3rd Qu.: 1238705 3rd Qu.: 6.000 3rd Qu.: 5.000
Max. :13454352 Max. :10.000 Max. :10.000
Uniformity_of_Cell_Shape Marginal_Adhesion Single_Epithelial_Cell_Size
Min. : 1.000 Min. : 1.00 Min. : 1.000
1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.: 2.000
Median : 1.000 Median : 1.00 Median : 2.000
Mean : 3.215 Mean : 2.83 Mean : 3.234
3rd Qu.: 5.000 3rd Qu.: 4.00 3rd Qu.: 4.000
Max. :10.000 Max. :10.00 Max. :10.000
Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses
Min. : 1.000 Min. : 1.000 Min. : 1.00 Min. : 1.000
1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 1.000
Median : 1.000 Median : 3.000 Median : 1.00 Median : 1.000
Mean : 3.545 Mean : 3.445 Mean : 2.87 Mean : 1.603
3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.00 3rd Qu.: 1.000
Max. :10.000 Max. :10.000 Max. :10.00 Max. :10.000
Class
Benign :444
Malignant:239
table(data$Class)
Benign Malignant
444 239
prop.table(table(data$Class))
Benign Malignant
0.6500732 0.3499268
Now that our data is clean - with no missing values or obvious outliers - and we’ve confirmed that the target variable is categorical (Benign vs. Malignant), we’re ready to move into the modeling phase. But before we dive in, there’s one small but important detail: setting a random seed. This ensures that our results are reproducible every time we run the code.
Next, we split the dataset into two parts: 70% for training the models and 30% for testing how well they perform on unseen data. This split helps us evaluate the model’s ability to generalize to new cases, which is critical in a real-world medical setting.
set.seed(123)
trainingIndex <- createDataPartition(data$Class, p = 0.7, list = FALSE)
trainData <- data[trainingIndex, -1] #this will exclude the ID variable
testData <- data[-trainingIndex, -1]With our training and testing sets in place, the next step is to start building our models. But before jumping into that, it’s important to introduce one more safeguard: cross-validation.
Cross-validation allows us to assess how well our models are likely to perform on new, unseen data by testing them across multiple folds of the training set. This helps reduce the risk of overfitting and gives us a more reliable estimate of performance before we even touch the test set.
set.seed(123)
#Define a task
task <- makeClassifTask(data = trainData, target = "Class")
#Define learners
logistic <- makeLearner("classif.logreg", predict.type = "prob")
knn <- makeLearner("classif.kknn", predict.type = "prob")
tree <- makeLearner("classif.rpart", predict.type = "prob")
svm <- makeLearner("classif.svm", predict.type = "prob", kernel = "radial")
rf <- makeLearner("classif.randomForest", predict.type = "prob")
lgbm <- makeLearner("classif.gbm", predict.type = "prob")To make it easier to compare the performance of all our models side by side, we’ll store each one in a single list. This way, we can loop through them for evaluation, plotting, and performance tracking without repeating code or creating unnecessary clutter.
Now we can implement all the models. We do no need to implement a hyper-parameter tuning at this point. This is solely for benchmarking. After selecting the best model we can now implement a hyper parameter tuning to improve model performance.
The Models We Explored
As mentioned earlier, we didn’t want to rely on a single algorithm. Instead, we tried a diverse set of models:
| Model | Description |
|---|---|
| Logistic Regression | A classic baseline model, simple and interpretable |
| K-Nearest Neighbors | A non-parametric model based on distance |
| Decision Tree | A rule-based model that splits on feature thresholds |
| Support Vector Machine | A margin-based classifier using RBF kernel |
| Random Forest | An ensemble of decision trees, robust and powerful |
| LightGBM | A gradient descent algorithm optimized for speed and performance |
Each of these brings strengths and weaknesses to the table. So how do they compare?
#list all the learners
learners <- list(logistic, knn, tree, svm, rf, lgbm)
names(learners) <- c("Logistic", "KNN", "Tree", "SVM", "RF", "LightGBM")
#Define cross validation
cv = makeResampleDesc("CV", iters = 5, stratify = TRUE)
#Performance measures
measure <- auc
#benchmark all models
bmr <- benchmark(learners, task, cv, measure)Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...
# Use AUC as the performance measure
measure <- aucplotBMRBoxplots(bmr)Key Metrics
# Apply best model to test data
test_task <- makeClassifTask(data = testData, target = "Class")
lgbm_model <- train(lgbm, task)Distribution not specified, assuming bernoulli ...
lgbm_pred <- predict(lgbm_model, test_task)
performance(lgbm_pred, measure) auc
0.9922694
plotBMRSummary(bmr, pretty.names = TRUE)# Convert benchmark result to threshold-vs-performance data
perf_data <- generateThreshVsPerfData(bmr, measures = list(fpr, tpr))
# Now plot ROC curves
plotROCCurves(
perf_data,
diagonal = TRUE,
pretty.names = TRUE,
facet.learner = FALSE
)The ROC curve above compares the performance of all six classification models by plotting their true positive rate (sensitivity) against their false positive rate. Each colored line represents a different learner: logistic regression, k-nearest neighbors (KNN), decision tree (rpart), support vector machine (SVM), random forest, and gradient boosting (GBM). We can see that most models perform exceptionally well, with curves hugging the top-left corner — which indicates a high true positive rate and a low false positive rate. This is exactly what we want in a classification task. Notably, the gradient boosting model (classif.gbm) and random forest (classif.randomForest) show nearly perfect separation between the classes, closely followed by logistic regression (classif.logreg). The decision tree (classif.rpart), while still performing well, shows a slightly lower curve, suggesting it’s more prone to false positives or false negatives compared to the ensemble models.
getBMRAggrPerformances(bmr, as.df = TRUE) task.id learner.id auc.test.mean
1 trainData classif.logreg 0.9941754
2 trainData classif.kknn 0.9868864
3 trainData classif.rpart 0.9537249
4 trainData classif.svm 0.9906273
5 trainData classif.randomForest 0.9922474
6 trainData classif.gbm 0.9907797
The Precision-Recall (PR) curve above provides another lens through which to evaluate our models, particularly in the context of class imbalance. Unlike the ROC curve, which focuses on true vs. false positive rates, this plot emphasizes the trade-off between precision (positive predictive value) and recall (true positive rate). In medical diagnostics, this is crucial — we want to correctly identify as many malignant cases as possible, while minimizing false alarms.
From the plot, we see that most models, including GBM (classif.gbm), random forest, and logistic regression, maintain high precision across a wide range of recall values — indicating strong performance. The SVM model (classif.svm) notably hugs the top of the plot, suggesting exceptionally high precision, especially at higher recall levels. Meanwhile, KNN (classif.kknn) shows more gradual growth, indicating it’s less precise across the board, particularly in the mid-range of recall.
pr_data <- generateThreshVsPerfData(bmr, measures = list(tpr, ppv))
plotROCCurves(pr_data, pretty.names = TRUE, facet.learner = FALSE)Now let’s focus only on the decision tree to understand how the cases (malignant and benign) cases were predicted.
tree_model <- train(learners$Tree, task)
rpart.plot(tree_model$learner.model,
type = 3, # show variable names and splits
extra = 101, # show probabilities and percentages
fallen.leaves = FALSE,
main = "Decision Tree - rpart")Explanation
The decision tree above helps us classify tumors as either benign or malignant based on a sequence of simple yes/no questions about cell characteristics. It begins by asking whether the Uniformity of Cell Size is less than 3.5. If the answer is no (i.e., the cell size is larger and more consistent), the model confidently predicts the tumor as malignant, since the majority of such cases fall into that category. If the answer is yes (cell size is smaller or more varied), the model digs deeper—asking additional questions about Bare Nuclei, Uniformity of Cell Shape, and Clump Thickness to refine the prediction.
For example, when Bare Nuclei is less than 2.5, the model predicts benign with high accuracy. But if it’s higher, it looks at other features to decide. Each path through the tree ends in a leaf node that shows the predicted class, the number of benign vs. malignant cases, and what percentage of the dataset falls into that category. This structure makes the model highly interpretable and provides insights into which features are most influential in determining cancer risk. It’s a great example of how even a simple model can uncover meaningful patterns in medical data.