Introduction

Breast cancer is one of the most common cancers and a major global health concern affecting millions of women worldwide. It occurs when abnormal cells in the breast grow uncontrollably, forming malignant tumors that may spread to other parts of the body if not detected early.

However, early diagnosis plays a critical role in improving patient survival and treatment outcomes. Detecting breast cancer at an early stage increases the chances of successful treatment, reduces disease progression, and lowers mortality rates. Modern diagnostic approaches such as histopathological examination and medical imaging techniques generate large amounts of measurable tumor characteristics, including radius, texture, perimeter, area, concavity, and symmetry. These quantitative features can be analyzed computationally to distinguish benign tumors from malignant tumors with high accuracy.

With the advancement of data science and machine learning, biomedical datasets containing tumor measurements can now be explored to identify important predictive patterns and support clinical decision-making.

Aim of the Project

This Project aims to explore tumor morphological features and develop a machine learning model capable of classifying breast tumors as benign or malignant

Objectives

Import and clean dataset
Perform exploratory data analysis
Identify important tumor characteristics
Normalize the data
Train K-Nearest Neighbor (KNN) classifier
Evaluate classification performance

Dataset description

The dataset used in this project is the Wisconsin Breast Cancer Diagnostic Dataset obtained from Kaggle and downloaded into local storage for analysis in R. The dataset contains 569 observations (rows) and 32 variables (columns), representing quantitative measurements computed from digitized images of fine needle aspirate (FNA) of breast masses.

The target variable in the dataset is the diagnosis column, which classifies tumors into two categories: Benign (B) and Malignant (M). This variable serves as the response class for machine learning classification and predictive modeling.

The remaining variables consist mainly of numerical morphological features describing characteristics of cell nuclei present in breast tissue images. These features include measurements such as radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension. Each characteristic is further represented using three feature groups: mean values, standard error (SE), and worst-case values, providing detailed quantitative information about tumor structure and behavior.

The richness and high dimensionality of the dataset make it highly suitable for exploratory data analysis, feature relationship assessment, and supervised machine learning applications for breast cancer diagnosis prediction.

Data Cleaning and Preprocessing

So this is where the project get intrested

Data Imporation

the data is imported using rio library and inspected using tidyverse library as shown below in the chunk

# require libraries
library(rio)
library(tidyverse)

# load the dataset from wdbc data set kaggle
wdbc_data <- import("C:/Users/user/Downloads/data.csv")
#inspecting the dataset
glimpse(wdbc_data)

## Rows: 569
## Columns: 33
## $ id                      <int> 842302, 842517, 84300903, 84348301, 84358402, …
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ `concave points_mean`   <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ `concave points_se`     <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ `concave points_worst`  <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
## $ V33                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

In the inspection of this dataste, variables are suppose to be 32 columns, but there are 33 now. The last column V33 is unexpected column, and also ID column is not needed because is a non-predictive.

Removal of Uncessecary variables

wdbc_data <- wdbc_data |> 
  select(!(c(id, V33)))
# inspecting the data again
glimpse(wdbc_data)

## Rows: 569
## Columns: 31
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ `concave points_mean`   <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ `concave points_se`     <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ `concave points_worst`  <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…

Now, Unnecessary variables like V33, and ID columns are being removed. V33 contain only missing values and ID is containing non-predictive values.

Column names standarding

Noticing that some column names like concave points_means and likes are inconsistent. so this column names are standardized using janitor

# require library
library(janitor)

wdbc_data <- wdbc_data |> 
  clean_names()
#inspecting again
glimpse(wdbc_data)

## Rows: 569
## Columns: 31
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ concave_points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ concave_points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ concave_points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…

Missing Value assesment

library(skimr)

skim(wdbc_data)

Data summary
Name	wdbc_data
Number of rows	569
Number of columns	31
_______________________
Column type frequency:
character	1
numeric	30
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
diagnosis	0	1	1	1	0	2	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
radius_mean	1	14.13	3.52	6.98	11.70	13.37	15.78	28.11	▂▇▃▁▁
texture_mean	1	19.29	4.30	9.71	16.17	18.84	21.80	39.28	▃▇▃▁▁
perimeter_mean	1	91.97	24.30	43.79	75.17	86.24	104.10	188.50	▃▇▃▁▁
area_mean	1	654.89	351.91	143.50	420.30	551.10	782.70	2501.00	▇▃▂▁▁
smoothness_mean	1	0.10	0.01	0.05	0.09	0.10	0.11	0.16	▁▇▇▁▁
compactness_mean	1	0.10	0.05	0.02	0.06	0.09	0.13	0.35	▇▇▂▁▁
concavity_mean	1	0.09	0.08	0.00	0.03	0.06	0.13	0.43	▇▃▂▁▁
concave_points_mean	1	0.05	0.04	0.00	0.02	0.03	0.07	0.20	▇▃▂▁▁
symmetry_mean	1	0.18	0.03	0.11	0.16	0.18	0.20	0.30	▁▇▅▁▁
fractal_dimension_mean	1	0.06	0.01	0.05	0.06	0.06	0.07	0.10	▆▇▂▁▁
radius_se	1	0.41	0.28	0.11	0.23	0.32	0.48	2.87	▇▁▁▁▁
texture_se	1	1.22	0.55	0.36	0.83	1.11	1.47	4.88	▇▅▁▁▁
perimeter_se	1	2.87	2.02	0.76	1.61	2.29	3.36	21.98	▇▁▁▁▁
area_se	1	40.34	45.49	6.80	17.85	24.53	45.19	542.20	▇▁▁▁▁
smoothness_se	1	0.01	0.00	0.00	0.01	0.01	0.01	0.03	▇▃▁▁▁
compactness_se	1	0.03	0.02	0.00	0.01	0.02	0.03	0.14	▇▃▁▁▁
concavity_se	1	0.03	0.03	0.00	0.02	0.03	0.04	0.40	▇▁▁▁▁
concave_points_se	1	0.01	0.01	0.00	0.01	0.01	0.01	0.05	▇▇▁▁▁
symmetry_se	1	0.02	0.01	0.01	0.02	0.02	0.02	0.08	▇▃▁▁▁
fractal_dimension_se	1	0.00	0.00	0.00	0.00	0.00	0.00	0.03	▇▁▁▁▁
radius_worst	1	16.27	4.83	7.93	13.01	14.97	18.79	36.04	▆▇▃▁▁
texture_worst	1	25.68	6.15	12.02	21.08	25.41	29.72	49.54	▃▇▆▁▁
perimeter_worst	1	107.26	33.60	50.41	84.11	97.66	125.40	251.20	▇▇▃▁▁
area_worst	1	880.58	569.36	185.20	515.30	686.50	1084.00	4254.00	▇▂▁▁▁
smoothness_worst	1	0.13	0.02	0.07	0.12	0.13	0.15	0.22	▂▇▇▂▁
compactness_worst	1	0.25	0.16	0.03	0.15	0.21	0.34	1.06	▇▅▁▁▁
concavity_worst	1	0.27	0.21	0.00	0.11	0.23	0.38	1.25	▇▅▂▁▁
concave_points_worst	1	0.11	0.07	0.00	0.06	0.10	0.16	0.29	▅▇▅▃▁
symmetry_worst	1	0.29	0.06	0.16	0.25	0.28	0.32	0.66	▅▇▁▁▁
fractal_dimension_worst	1	0.08	0.02	0.06	0.07	0.08	0.09	0.21	▇▃▁▁▁

No missing values were detected in the dataset.

sum(is.na(wdbc_data))

## [1] 0

Data transformation

Diagnosis variables was converted to categorical factor, to make it readily usable for machine learning training as dependent variable

# Factories diagnosis variable  
wdbc_data$diagnosis <- factor(wdbc_data$diagnosis, levels = c ("B", "M"),
                              labels = c("Benign", "Malignant")) 
# Checking percentage of label data
round(prop.table(table(wdbc_data$diagnosis)) *100, digits = 1)

## 
##    Benign Malignant 
##      62.7      37.3

sixty two point seven percent (62.7%) of the data are bening, and thirty seven point three percent (37.3%) are malignant

# normalization 
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

Dataset were normalized using min-max scaling technique

Exploratory data analysis

# visualization packages
library(ggplot2)
library(GGally)

## Warning: package 'GGally' was built under R version 4.5.3

library(corrplot)
library(reshape2)
library(patchwork)

Figure 1: Class distrubution of Breast cancer diagnosis

Interpretation;

The correlation matrix reveals strong positive relationships among size-related features such as radius, perimeter, and area. This indicates that these variables are highly interdependent and likely capture similar biological information about tumor growth. High multicollinearity is observed across several feature groups, suggesting redundancy in the dataset. This structure is beneficial for machine learning models like KNN, which rely on distance-based similarity.

Figure 2: Correlation structure of the tumor feature

# correlation matrix
cor_matrix <- cor(wdbc_data |> select(-diagnosis))

Interpretation:

The pairwise scatterplot matrix provides a comprehensive view of relationships between selected tumor features. A clear separation is observed between benign and malignant cases across multiple feature distributions, particularly for radius, perimeter, area, concavity, and compactness. Malignant tumors tend to cluster toward higher values, indicating larger and more irregular cell structures. The diagonal density plots further highlight distinct distribution shifts between the two classes, reinforcing the predictive strength of these features.

Figure 3: Pairwise relationships between key tumor features

# selecting feature
selected_features <- wdbc_data |>
  select(
    diagnosis,
    radius_mean,
    perimeter_mean,
    area_mean,
    concavity_mean,
    compactness_mean
  )

Interpretation:

The distribution plots show that malignant tumors consistently exhibit higher values across most morphological features compared to benign tumors. Features such as area, perimeter, and concavity demonstrate strong class separation. This suggests that tumor geometry and irregularity are strong indicators of malignancy and can be effectively used for classification tasks.

Machine Learning model development

Feature scaling

K-Nearest Neighbors (KNN) is a distance-based algorithm, meaning it classifies data points by measuring how close they are to each other using a distance metric such as Euclidean distance. In datasets like this one, the features are measured on different scales (for example, area values are much larger in magnitude than texture or smoothness values). Without scaling, features with larger numerical ranges would disproportionately influence the distance calculation, leading to biased and unreliable predictions.

To address this, feature scaling (normalization) was applied to transform all variables to a common range between 0 and 1. This ensures that each feature contributes equally to the distance computation, allowing the model to make more balanced and meaningful comparisons between observations. As a result, the classification performance becomes more stable and reflective of true underlying patterns in the data rather than differences in measurement scale.

# apply normalisation
wdbc_n <- as.data.frame(lapply(wdbc_data[2 : 31], normalize))

summary(wdbc_n[2:5])

##   texture_mean    perimeter_mean     area_mean      smoothness_mean 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2185   1st Qu.:0.2168   1st Qu.:0.1174   1st Qu.:0.3046  
##  Median :0.3088   Median :0.2933   Median :0.1729   Median :0.3904  
##  Mean   :0.3240   Mean   :0.3329   Mean   :0.2169   Mean   :0.3948  
##  3rd Qu.:0.4089   3rd Qu.:0.4168   3rd Qu.:0.2711   3rd Qu.:0.4755  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Train-Test split

To evaluate the performance and generalization ability of the machine learning model, the dataset was divided into two distinct subsets: a training set and a test set. The training set consisted of 469 observations, while the test set contained 100 observations.

The training set is used to fit the K-Nearest Neighbors (KNN) model, allowing it to learn patterns and relationships between tumor features and the corresponding diagnosis labels. In contrast, the test set is kept completely unseen during model training and is used solely for performance evaluation.

This separation is critical in machine learning because it provides an unbiased assessment of how well the model can generalize to new, unseen data. By evaluating the model on the test set, we can better estimate its real-world predictive performance and reduce the risk of overfitting, where a model performs well on training data but poorly on new observations.

# splitting the dataset into test and train
wdbc_train <- wdbc_n[1:469, ]
wdbc_test <- wdbc_n[470:569, ]

# splitting dependent variable into test and train 
wdbc_train_label <- wdbc_data[1:469, 1]
wdbc_test_label <- wdbc_data[470:569, 1]

K-Nearest Neighbor Algorithm

The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method used for classification tasks. It is a non-parametric, instance-based learning approach, meaning it does not assume any underlying distribution of the data and does not build an explicit model during training. Instead, it stores the training data and performs classification based on similarity between observations.

In this project, KNN classifies each test sample by identifying the k closest observations from the training dataset and assigning the class that appears most frequently among those neighbors (majority voting principle). The similarity between observations is measured using Euclidean distance, which calculates the straight-line distance between data points in multidimensional feature space.

A value of k = 21 was selected for the model. This parameter controls the number of neighbors considered during classification and plays a crucial role in balancing model bias and variance. A relatively larger k-value helps to reduce noise sensitivity and produces smoother decision boundaries, which is particularly useful in biomedical datasets where feature variability can be high.

Formular/ function for this algorithm p <- knn(train = training data, test = testing data, cl = dependent variable, k = can be 21)

# require library 
library(class)

wdbc_prediction <- knn(train = wdbc_train, test = wdbc_test, cl = wdbc_train_label, k = 21)

Model Evaluation

Model performance was evaluated using a confusion matrix, which provides a detailed breakdown of correct and incorrect predictions made by the K-Nearest Neighbors (KNN) classifier. This evaluation approach is particularly important in medical classification problems, where the consequences of different types of errors are not equal.

# require library 
library(gmodels)

CrossTable(x = wdbc_test_label, y = wdbc_prediction, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                 | wdbc_prediction 
## wdbc_test_label |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |        77 |         0 |        77 | 
##                 |     1.000 |     0.000 |     0.770 | 
##                 |     0.975 |     0.000 |           | 
##                 |     0.770 |     0.000 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         2 |        21 |        23 | 
##                 |     0.087 |     0.913 |     0.230 | 
##                 |     0.025 |     1.000 |           | 
##                 |     0.020 |     0.210 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |        79 |        21 |       100 | 
##                 |     0.790 |     0.210 |           | 
## ----------------|-----------|-----------|-----------|
## 
##

In this context, the confusion matrix is interpreted as follows:

True Positive (TP): Malignant tumors correctly classified as malignant
True Negative (TN): Benign tumors correctly classified as benign
False Positive (FP): Benign tumors incorrectly classified as malignant
False Negative (FN): Malignant tumors incorrectly classified as benign

Among these outcomes, false negatives are especially critical in a clinical setting because they represent cases where a malignant tumor is mistakenly labeled as benign, potentially delaying diagnosis and treatment.

The confusion matrix obtained from the model shows a high number of correct classifications, indicating strong predictive performance. The model achieved excellent separation between benign and malignant cases, with only a small number of misclassifications. This suggests that the selected features, combined with proper normalization and the KNN algorithm, are highly effective for breast cancer classification.

Classification performance

Based on the confusion matrix obtained in this study:

1. True Positive (TP) = 21 2. True Negative (TN) = 77 3. False Positive (FP) = 0 4. False Negative (FN) = 2

Accuracy:

Formula = (TP + TN) / (TP + TN + FN + FP) Accuracy = 98% The model achieved an accuracy of 98%, indicating that the classifier correctly predicted 98 out of every 100 tumor cases.

Specificity

Formula = (TN) / (TN + FP)

Specificity = 100% The model achieved a specificity of 100%, meaning that all benign tumors in the test dataset were correctly classified without any false positive predictions.

Sensitivity

Formula = (TP) / (TP + FN)

Sensitivity = 91.3% The sensitivity of the model was approximately 91.3%, indicating that most malignant tumors were correctly detected by the classifier.

Overall, the evaluation metrics demonstrate that the KNN classifier exhibited excellent predictive capability for breast cancer diagnosis. The high accuracy, strong sensitivity, and perfect specificity suggest that the selected tumor features provide substantial discriminatory power for distinguishing between benign and malignant breast tumors.

Discussion

This study demonstrated the effectiveness of machine learning techniques in the classification of breast tumors using morphological characteristics extracted from digitized breast tissue images. The exploratory data analysis revealed strong structural relationships among several tumor features, while the K-Nearest Neighbors (KNN) model achieved high classification performance after appropriate preprocessing and normalization.

One of the major findings from the exploratory analysis was the presence of strong positive correlations among size-related variables such as radius, perimeter, and area. Biologically, this relationship is expected because larger tumors generally possess greater perimeter measurements and occupy larger surface areas. Additionally, malignant tumors consistently exhibited higher values for features associated with irregular growth patterns, including concavity and compactness. These findings suggest that tumor morphology plays a significant role in distinguishing malignant tissue from benign tissue.

The pairwise visualization analysis further demonstrated clear class separability between benign and malignant tumors. Malignant samples tended to cluster around larger and more irregular feature values, while benign tumors were more concentrated within lower ranges. This visible separation indicates that the dataset contains highly informative features capable of supporting accurate predictive modeling.

The KNN algorithm performed particularly well in this study due to the structure of the dataset and the nature of the extracted features. Since KNN is a distance-based algorithm, observations with similar tumor characteristics are grouped closely together within the feature space. The relatively distinct separation between benign and malignant samples allowed the algorithm to effectively classify unseen observations using neighborhood similarity. Furthermore, normalization significantly improved model performance by ensuring that all variables contributed equally to distance calculations, preventing features with larger numerical scales from dominating the classification process.

Despite the strong performance observed, several limitations should be acknowledged. First, the dataset is relatively small compared to large-scale clinical datasets commonly used in modern biomedical machine learning research. A larger and more diverse dataset could improve model generalizability and robustness. Second, the train-test split used in this project was sequential rather than randomized, which may introduce sampling bias and affect evaluation reliability. Randomized sampling or cross-validation techniques would provide a more reliable estimate of model performance. Finally, only a single machine learning algorithm (KNN) was explored. Although KNN achieved excellent results, comparing multiple classification models such as Random Forest, Support Vector Machine (SVM), Logistic Regression, or XGBoost could provide deeper insights into optimal predictive performance for this dataset.

Overall, the findings from this project highlight the strong potential of machine learning approaches in supporting breast cancer diagnosis. The combination of exploratory data analysis, feature engineering, normalization, and predictive modeling demonstrates how computational techniques can assist in identifying clinically relevant patterns within biomedical datasets.

Conclusion

In conclusion, this project demonstrated that morphological tumor features can effectively distinguish benign from malignant breast tumors using machine learning techniques. Exploratory analysis revealed strong inter-feature relationships and clear class separability, while the KNN classifier achieved high predictive accuracy after feature normalization.

Appendix 1

The clean dataset

View(wdbc_data[100, ])

Appendix 2

summary of the clean data

summary(wdbc_data)

##      diagnosis    radius_mean      texture_mean   perimeter_mean  
##  Benign   :357   Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  Malignant:212   1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##                  Median :13.370   Median :18.84   Median : 86.24  
##                  Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##                  3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##                  Max.   :28.110   Max.   :39.28   Max.   :188.50  
##    area_mean      smoothness_mean   compactness_mean  concavity_mean   
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
##  concave_points_mean symmetry_mean    fractal_dimension_mean   radius_se     
##  Min.   :0.00000     Min.   :0.1060   Min.   :0.04996        Min.   :0.1115  
##  1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770        1st Qu.:0.2324  
##  Median :0.03350     Median :0.1792   Median :0.06154        Median :0.3242  
##  Mean   :0.04892     Mean   :0.1812   Mean   :0.06280        Mean   :0.4052  
##  3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612        3rd Qu.:0.4789  
##  Max.   :0.20120     Max.   :0.3040   Max.   :0.09744        Max.   :2.8730  
##    texture_se      perimeter_se       area_se        smoothness_se     
##  Min.   :0.3602   Min.   : 0.757   Min.   :  6.802   Min.   :0.001713  
##  1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850   1st Qu.:0.005169  
##  Median :1.1080   Median : 2.287   Median : 24.530   Median :0.006380  
##  Mean   :1.2169   Mean   : 2.866   Mean   : 40.337   Mean   :0.007041  
##  3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190   3rd Qu.:0.008146  
##  Max.   :4.8850   Max.   :21.980   Max.   :542.200   Max.   :0.031130  
##  compactness_se      concavity_se     concave_points_se   symmetry_se      
##  Min.   :0.002252   Min.   :0.00000   Min.   :0.000000   Min.   :0.007882  
##  1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638   1st Qu.:0.015160  
##  Median :0.020450   Median :0.02589   Median :0.010930   Median :0.018730  
##  Mean   :0.025478   Mean   :0.03189   Mean   :0.011796   Mean   :0.020542  
##  3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710   3rd Qu.:0.023480  
##  Max.   :0.135400   Max.   :0.39600   Max.   :0.052790   Max.   :0.078950  
##  fractal_dimension_se  radius_worst   texture_worst   perimeter_worst 
##  Min.   :0.0008948    Min.   : 7.93   Min.   :12.02   Min.   : 50.41  
##  1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11  
##  Median :0.0031870    Median :14.97   Median :25.41   Median : 97.66  
##  Mean   :0.0037949    Mean   :16.27   Mean   :25.68   Mean   :107.26  
##  3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40  
##  Max.   :0.0298400    Max.   :36.04   Max.   :49.54   Max.   :251.20  
##    area_worst     smoothness_worst  compactness_worst concavity_worst 
##  Min.   : 185.2   Min.   :0.07117   Min.   :0.02729   Min.   :0.0000  
##  1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145  
##  Median : 686.5   Median :0.13130   Median :0.21190   Median :0.2267  
##  Mean   : 880.6   Mean   :0.13237   Mean   :0.25427   Mean   :0.2722  
##  3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829  
##  Max.   :4254.0   Max.   :0.22260   Max.   :1.05800   Max.   :1.2520  
##  concave_points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.09993      Median :0.2822   Median :0.08004        
##  Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.29100      Max.   :0.6638   Max.   :0.20750

Dedication

I dedicate this project to my beloved mother, whose endless love, prayers, sacrifices, and unwavering support have been a source of strength throughout my academic journey. Your care and encouragement continue to inspire me to strive for excellence in all that I do.

To my father, who has remained a strong pillar in my life, thank you for your guidance, support, discipline, and constant belief in my potential. Your efforts and sacrifices have laid the foundation for my growth and success.

I also dedicate this work to my darling girlfriend, Zahra Ayunie💕❤️, an amazing and supportive soul who has stood by me through thick and thin. Thank you for your love, encouragement, advice, and for sharing dreams and aspirations with me. Your presence has brought motivation, comfort, and happiness into my journey. I pray that Allah grants us success and makes our dreams come to pass, Insha’Allah.

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Africa/Lagos
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gmodels_2.19.1  class_7.3-23    patchwork_1.3.2 reshape2_1.4.5 
##  [5] corrplot_0.95   GGally_2.4.0    skimr_2.2.2     janitor_2.2.1  
##  [9] lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0   dplyr_1.2.0    
## [13] purrr_1.2.1     readr_2.2.0     tidyr_1.3.2     tibble_3.3.1   
## [17] ggplot2_4.0.2   tidyverse_2.0.0 rio_1.2.4      
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10         generics_0.1.4      gtools_3.9.5       
##  [4] stringi_1.8.7       hms_1.1.4           digest_0.6.39      
##  [7] magrittr_2.0.4      evaluate_1.0.5      grid_4.5.2         
## [10] timechange_0.4.0    RColorBrewer_1.1-3  fastmap_1.2.0      
## [13] plyr_1.8.9          R.oo_1.27.1         jsonlite_2.0.0     
## [16] R.utils_2.13.0      scales_1.4.0        jquerylib_0.1.4    
## [19] cli_3.6.5           rlang_1.1.7         R.methodsS3_1.8.2  
## [22] base64enc_0.1-6     withr_3.0.2         repr_1.1.7         
## [25] cachem_1.1.0        yaml_2.3.12         tools_4.5.2        
## [28] tzdb_0.5.0          ggstats_0.13.0      vctrs_0.7.1        
## [31] R6_2.6.1            lifecycle_1.0.5     snakecase_0.11.1   
## [34] MASS_7.3-65         pkgconfig_2.0.3     pillar_1.11.1      
## [37] bslib_0.10.0        gtable_0.3.6        Rcpp_1.1.1         
## [40] glue_1.8.0          data.table_1.18.2.1 xfun_0.56          
## [43] tidyselect_1.2.1    rstudioapi_0.18.0   knitr_1.51         
## [46] farver_2.1.2        htmltools_0.5.9     gdata_3.0.1        
## [49] labeling_0.4.3      rmarkdown_2.30      compiler_4.5.2     
## [52] S7_0.2.1

Machine Learning-Based Breast Cancer Diagnosis Using Morphological Features

Abdul-Baasit Ismail Olatunjin, Folaran Hajarat Dasola, Fatimah Mustapha Falimata.

22nd may 2025