Breast cancer is one of the most common cancers and a major global health concern affecting millions of women worldwide. It occurs when abnormal cells in the breast grow uncontrollably, forming malignant tumors that may spread to other parts of the body if not detected early.
However, early diagnosis plays a critical role in improving patient survival and treatment outcomes. Detecting breast cancer at an early stage increases the chances of successful treatment, reduces disease progression, and lowers mortality rates. Modern diagnostic approaches such as histopathological examination and medical imaging techniques generate large amounts of measurable tumor characteristics, including radius, texture, perimeter, area, concavity, and symmetry. These quantitative features can be analyzed computationally to distinguish benign tumors from malignant tumors with high accuracy.
With the advancement of data science and machine learning, biomedical datasets containing tumor measurements can now be explored to identify important predictive patterns and support clinical decision-making.
This Project aims to explore tumor morphological features and develop a machine learning model capable of classifying breast tumors as benign or malignant
The dataset used in this project is the Wisconsin Breast Cancer Diagnostic Dataset obtained from Kaggle and downloaded into local storage for analysis in R. The dataset contains 569 observations (rows) and 32 variables (columns), representing quantitative measurements computed from digitized images of fine needle aspirate (FNA) of breast masses.
The target variable in the dataset is the diagnosis column, which classifies tumors into two categories: Benign (B) and Malignant (M). This variable serves as the response class for machine learning classification and predictive modeling.
The remaining variables consist mainly of numerical morphological features describing characteristics of cell nuclei present in breast tissue images. These features include measurements such as radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension. Each characteristic is further represented using three feature groups: mean values, standard error (SE), and worst-case values, providing detailed quantitative information about tumor structure and behavior.
The richness and high dimensionality of the dataset make it highly suitable for exploratory data analysis, feature relationship assessment, and supervised machine learning applications for breast cancer diagnosis prediction.
So this is where the project get intrested
the data is imported using rio library and inspected using tidyverse library as shown below in the chunk
# require libraries
library(rio)
library(tidyverse)
# load the dataset from wdbc data set kaggle
wdbc_data <- import("C:/Users/user/Downloads/data.csv")
#inspecting the dataset
glimpse(wdbc_data)
## Rows: 569
## Columns: 33
## $ id <int> 842302, 842517, 84300903, 84348301, 84358402, …
## $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ `concave points_mean` <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ `concave points_se` <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ `concave points_worst` <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
## $ V33 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
In the inspection of this dataste, variables are suppose to be 32 columns, but there are 33 now. The last column V33 is unexpected column, and also ID column is not needed because is a non-predictive.
wdbc_data <- wdbc_data |>
select(!(c(id, V33)))
# inspecting the data again
glimpse(wdbc_data)
## Rows: 569
## Columns: 31
## $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ `concave points_mean` <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ `concave points_se` <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ `concave points_worst` <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
Now, Unnecessary variables like V33, and ID columns are being removed. V33 contain only missing values and ID is containing non-predictive values.
Noticing that some column names like
concave points_means and likes are inconsistent. so this
column names are standardized using janitor
# require library
library(janitor)
wdbc_data <- wdbc_data |>
clean_names()
#inspecting again
glimpse(wdbc_data)
## Rows: 569
## Columns: 31
## $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ concave_points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ concave_points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ concave_points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
library(skimr)
skim(wdbc_data)
| Name | wdbc_data |
| Number of rows | 569 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 30 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| diagnosis | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| radius_mean | 0 | 1 | 14.13 | 3.52 | 6.98 | 11.70 | 13.37 | 15.78 | 28.11 | ▂▇▃▁▁ |
| texture_mean | 0 | 1 | 19.29 | 4.30 | 9.71 | 16.17 | 18.84 | 21.80 | 39.28 | ▃▇▃▁▁ |
| perimeter_mean | 0 | 1 | 91.97 | 24.30 | 43.79 | 75.17 | 86.24 | 104.10 | 188.50 | ▃▇▃▁▁ |
| area_mean | 0 | 1 | 654.89 | 351.91 | 143.50 | 420.30 | 551.10 | 782.70 | 2501.00 | ▇▃▂▁▁ |
| smoothness_mean | 0 | 1 | 0.10 | 0.01 | 0.05 | 0.09 | 0.10 | 0.11 | 0.16 | ▁▇▇▁▁ |
| compactness_mean | 0 | 1 | 0.10 | 0.05 | 0.02 | 0.06 | 0.09 | 0.13 | 0.35 | ▇▇▂▁▁ |
| concavity_mean | 0 | 1 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 | ▇▃▂▁▁ |
| concave_points_mean | 0 | 1 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 | ▇▃▂▁▁ |
| symmetry_mean | 0 | 1 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 | ▁▇▅▁▁ |
| fractal_dimension_mean | 0 | 1 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 | ▆▇▂▁▁ |
| radius_se | 0 | 1 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 | ▇▁▁▁▁ |
| texture_se | 0 | 1 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 | ▇▅▁▁▁ |
| perimeter_se | 0 | 1 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 | ▇▁▁▁▁ |
| area_se | 0 | 1 | 40.34 | 45.49 | 6.80 | 17.85 | 24.53 | 45.19 | 542.20 | ▇▁▁▁▁ |
| smoothness_se | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | ▇▃▁▁▁ |
| compactness_se | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 | ▇▃▁▁▁ |
| concavity_se | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 | ▇▁▁▁▁ |
| concave_points_se | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 | ▇▇▁▁▁ |
| symmetry_se | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 | ▇▃▁▁▁ |
| fractal_dimension_se | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | ▇▁▁▁▁ |
| radius_worst | 0 | 1 | 16.27 | 4.83 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 | ▆▇▃▁▁ |
| texture_worst | 0 | 1 | 25.68 | 6.15 | 12.02 | 21.08 | 25.41 | 29.72 | 49.54 | ▃▇▆▁▁ |
| perimeter_worst | 0 | 1 | 107.26 | 33.60 | 50.41 | 84.11 | 97.66 | 125.40 | 251.20 | ▇▇▃▁▁ |
| area_worst | 0 | 1 | 880.58 | 569.36 | 185.20 | 515.30 | 686.50 | 1084.00 | 4254.00 | ▇▂▁▁▁ |
| smoothness_worst | 0 | 1 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 | ▂▇▇▂▁ |
| compactness_worst | 0 | 1 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 | ▇▅▁▁▁ |
| concavity_worst | 0 | 1 | 0.27 | 0.21 | 0.00 | 0.11 | 0.23 | 0.38 | 1.25 | ▇▅▂▁▁ |
| concave_points_worst | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 | ▅▇▅▃▁ |
| symmetry_worst | 0 | 1 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 | ▅▇▁▁▁ |
| fractal_dimension_worst | 0 | 1 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 | ▇▃▁▁▁ |
No missing values were detected in the dataset.
sum(is.na(wdbc_data))
## [1] 0
Diagnosis variables was converted to categorical factor, to make it readily usable for machine learning training as dependent variable
# Factories diagnosis variable
wdbc_data$diagnosis <- factor(wdbc_data$diagnosis, levels = c ("B", "M"),
labels = c("Benign", "Malignant"))
# Checking percentage of label data
round(prop.table(table(wdbc_data$diagnosis)) *100, digits = 1)
##
## Benign Malignant
## 62.7 37.3
sixty two point seven percent (62.7%) of the data are bening, and thirty seven point three percent (37.3%) are malignant
# normalization
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
Dataset were normalized using min-max scaling technique
# visualization packages
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.3
library(corrplot)
library(reshape2)
library(patchwork)
Interpretation;
The correlation matrix reveals strong positive relationships among size-related features such as radius, perimeter, and area. This indicates that these variables are highly interdependent and likely capture similar biological information about tumor growth. High multicollinearity is observed across several feature groups, suggesting redundancy in the dataset. This structure is beneficial for machine learning models like KNN, which rely on distance-based similarity.
# correlation matrix
cor_matrix <- cor(wdbc_data |> select(-diagnosis))
Interpretation:
The pairwise scatterplot matrix provides a comprehensive view of relationships between selected tumor features. A clear separation is observed between benign and malignant cases across multiple feature distributions, particularly for radius, perimeter, area, concavity, and compactness. Malignant tumors tend to cluster toward higher values, indicating larger and more irregular cell structures. The diagonal density plots further highlight distinct distribution shifts between the two classes, reinforcing the predictive strength of these features.
# selecting feature
selected_features <- wdbc_data |>
select(
diagnosis,
radius_mean,
perimeter_mean,
area_mean,
concavity_mean,
compactness_mean
)
Interpretation:
The distribution plots show that malignant tumors consistently exhibit higher values across most morphological features compared to benign tumors. Features such as area, perimeter, and concavity demonstrate strong class separation. This suggests that tumor geometry and irregularity are strong indicators of malignancy and can be effectively used for classification tasks.
K-Nearest Neighbors (KNN) is a distance-based algorithm, meaning it classifies data points by measuring how close they are to each other using a distance metric such as Euclidean distance. In datasets like this one, the features are measured on different scales (for example, area values are much larger in magnitude than texture or smoothness values). Without scaling, features with larger numerical ranges would disproportionately influence the distance calculation, leading to biased and unreliable predictions.
To address this, feature scaling (normalization) was applied to transform all variables to a common range between 0 and 1. This ensures that each feature contributes equally to the distance computation, allowing the model to make more balanced and meaningful comparisons between observations. As a result, the classification performance becomes more stable and reflective of true underlying patterns in the data rather than differences in measurement scale.
# apply normalisation
wdbc_n <- as.data.frame(lapply(wdbc_data[2 : 31], normalize))
summary(wdbc_n[2:5])
## texture_mean perimeter_mean area_mean smoothness_mean
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2185 1st Qu.:0.2168 1st Qu.:0.1174 1st Qu.:0.3046
## Median :0.3088 Median :0.2933 Median :0.1729 Median :0.3904
## Mean :0.3240 Mean :0.3329 Mean :0.2169 Mean :0.3948
## 3rd Qu.:0.4089 3rd Qu.:0.4168 3rd Qu.:0.2711 3rd Qu.:0.4755
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
To evaluate the performance and generalization ability of the machine learning model, the dataset was divided into two distinct subsets: a training set and a test set. The training set consisted of 469 observations, while the test set contained 100 observations.
The training set is used to fit the K-Nearest Neighbors (KNN) model, allowing it to learn patterns and relationships between tumor features and the corresponding diagnosis labels. In contrast, the test set is kept completely unseen during model training and is used solely for performance evaluation.
This separation is critical in machine learning because it provides an unbiased assessment of how well the model can generalize to new, unseen data. By evaluating the model on the test set, we can better estimate its real-world predictive performance and reduce the risk of overfitting, where a model performs well on training data but poorly on new observations.
# splitting the dataset into test and train
wdbc_train <- wdbc_n[1:469, ]
wdbc_test <- wdbc_n[470:569, ]
# splitting dependent variable into test and train
wdbc_train_label <- wdbc_data[1:469, 1]
wdbc_test_label <- wdbc_data[470:569, 1]
The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method used for classification tasks. It is a non-parametric, instance-based learning approach, meaning it does not assume any underlying distribution of the data and does not build an explicit model during training. Instead, it stores the training data and performs classification based on similarity between observations.
In this project, KNN classifies each test sample by identifying the k closest observations from the training dataset and assigning the class that appears most frequently among those neighbors (majority voting principle). The similarity between observations is measured using Euclidean distance, which calculates the straight-line distance between data points in multidimensional feature space.
A value of k = 21 was selected for the model. This parameter controls the number of neighbors considered during classification and plays a crucial role in balancing model bias and variance. A relatively larger k-value helps to reduce noise sensitivity and produces smoother decision boundaries, which is particularly useful in biomedical datasets where feature variability can be high.
Formular/ function for this algorithm p <- knn(train = training data, test = testing data, cl = dependent variable, k = can be 21)
# require library
library(class)
wdbc_prediction <- knn(train = wdbc_train, test = wdbc_test, cl = wdbc_train_label, k = 21)
Model performance was evaluated using a confusion matrix, which provides a detailed breakdown of correct and incorrect predictions made by the K-Nearest Neighbors (KNN) classifier. This evaluation approach is particularly important in medical classification problems, where the consequences of different types of errors are not equal.
# require library
library(gmodels)
CrossTable(x = wdbc_test_label, y = wdbc_prediction, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wdbc_prediction
## wdbc_test_label | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## ----------------|-----------|-----------|-----------|
##
##
In this context, the confusion matrix is interpreted as follows:
Among these outcomes, false negatives are especially critical in a clinical setting because they represent cases where a malignant tumor is mistakenly labeled as benign, potentially delaying diagnosis and treatment.
The confusion matrix obtained from the model shows a high number of correct classifications, indicating strong predictive performance. The model achieved excellent separation between benign and malignant cases, with only a small number of misclassifications. This suggests that the selected features, combined with proper normalization and the KNN algorithm, are highly effective for breast cancer classification.
Based on the confusion matrix obtained in this study:
1. True Positive (TP) = 21 2. True Negative (TN) = 77 3. False Positive (FP) = 0 4. False Negative (FN) = 2
Accuracy:
Formula = (TP + TN) / (TP + TN + FN + FP) Accuracy = 98% The model achieved an accuracy of 98%, indicating that the classifier correctly predicted 98 out of every 100 tumor cases.
Specificity
Formula = (TN) / (TN + FP)
Specificity = 100% The model achieved a specificity of 100%, meaning that all benign tumors in the test dataset were correctly classified without any false positive predictions.
Sensitivity
Formula = (TP) / (TP + FN)
Sensitivity = 91.3% The sensitivity of the model was approximately 91.3%, indicating that most malignant tumors were correctly detected by the classifier.
Overall, the evaluation metrics demonstrate that the KNN classifier exhibited excellent predictive capability for breast cancer diagnosis. The high accuracy, strong sensitivity, and perfect specificity suggest that the selected tumor features provide substantial discriminatory power for distinguishing between benign and malignant breast tumors.
This study demonstrated the effectiveness of machine learning techniques in the classification of breast tumors using morphological characteristics extracted from digitized breast tissue images. The exploratory data analysis revealed strong structural relationships among several tumor features, while the K-Nearest Neighbors (KNN) model achieved high classification performance after appropriate preprocessing and normalization.
One of the major findings from the exploratory analysis was the presence of strong positive correlations among size-related variables such as radius, perimeter, and area. Biologically, this relationship is expected because larger tumors generally possess greater perimeter measurements and occupy larger surface areas. Additionally, malignant tumors consistently exhibited higher values for features associated with irregular growth patterns, including concavity and compactness. These findings suggest that tumor morphology plays a significant role in distinguishing malignant tissue from benign tissue.
The pairwise visualization analysis further demonstrated clear class separability between benign and malignant tumors. Malignant samples tended to cluster around larger and more irregular feature values, while benign tumors were more concentrated within lower ranges. This visible separation indicates that the dataset contains highly informative features capable of supporting accurate predictive modeling.
The KNN algorithm performed particularly well in this study due to the structure of the dataset and the nature of the extracted features. Since KNN is a distance-based algorithm, observations with similar tumor characteristics are grouped closely together within the feature space. The relatively distinct separation between benign and malignant samples allowed the algorithm to effectively classify unseen observations using neighborhood similarity. Furthermore, normalization significantly improved model performance by ensuring that all variables contributed equally to distance calculations, preventing features with larger numerical scales from dominating the classification process.
Despite the strong performance observed, several limitations should be acknowledged. First, the dataset is relatively small compared to large-scale clinical datasets commonly used in modern biomedical machine learning research. A larger and more diverse dataset could improve model generalizability and robustness. Second, the train-test split used in this project was sequential rather than randomized, which may introduce sampling bias and affect evaluation reliability. Randomized sampling or cross-validation techniques would provide a more reliable estimate of model performance. Finally, only a single machine learning algorithm (KNN) was explored. Although KNN achieved excellent results, comparing multiple classification models such as Random Forest, Support Vector Machine (SVM), Logistic Regression, or XGBoost could provide deeper insights into optimal predictive performance for this dataset.
Overall, the findings from this project highlight the strong potential of machine learning approaches in supporting breast cancer diagnosis. The combination of exploratory data analysis, feature engineering, normalization, and predictive modeling demonstrates how computational techniques can assist in identifying clinically relevant patterns within biomedical datasets.
In conclusion, this project demonstrated that morphological tumor features can effectively distinguish benign from malignant breast tumors using machine learning techniques. Exploratory analysis revealed strong inter-feature relationships and clear class separability, while the KNN classifier achieved high predictive accuracy after feature normalization.
The clean dataset
View(wdbc_data[100, ])
summary of the clean data
summary(wdbc_data)
## diagnosis radius_mean texture_mean perimeter_mean
## Benign :357 Min. : 6.981 Min. : 9.71 Min. : 43.79
## Malignant:212 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## concave_points_mean symmetry_mean fractal_dimension_mean radius_se
## Min. :0.00000 Min. :0.1060 Min. :0.04996 Min. :0.1115
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770 1st Qu.:0.2324
## Median :0.03350 Median :0.1792 Median :0.06154 Median :0.3242
## Mean :0.04892 Mean :0.1812 Mean :0.06280 Mean :0.4052
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612 3rd Qu.:0.4789
## Max. :0.20120 Max. :0.3040 Max. :0.09744 Max. :2.8730
## texture_se perimeter_se area_se smoothness_se
## Min. :0.3602 Min. : 0.757 Min. : 6.802 Min. :0.001713
## 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850 1st Qu.:0.005169
## Median :1.1080 Median : 2.287 Median : 24.530 Median :0.006380
## Mean :1.2169 Mean : 2.866 Mean : 40.337 Mean :0.007041
## 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190 3rd Qu.:0.008146
## Max. :4.8850 Max. :21.980 Max. :542.200 Max. :0.031130
## compactness_se concavity_se concave_points_se symmetry_se
## Min. :0.002252 Min. :0.00000 Min. :0.000000 Min. :0.007882
## 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638 1st Qu.:0.015160
## Median :0.020450 Median :0.02589 Median :0.010930 Median :0.018730
## Mean :0.025478 Mean :0.03189 Mean :0.011796 Mean :0.020542
## 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710 3rd Qu.:0.023480
## Max. :0.135400 Max. :0.39600 Max. :0.052790 Max. :0.078950
## fractal_dimension_se radius_worst texture_worst perimeter_worst
## Min. :0.0008948 Min. : 7.93 Min. :12.02 Min. : 50.41
## 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11
## Median :0.0031870 Median :14.97 Median :25.41 Median : 97.66
## Mean :0.0037949 Mean :16.27 Mean :25.68 Mean :107.26
## 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40
## Max. :0.0298400 Max. :36.04 Max. :49.54 Max. :251.20
## area_worst smoothness_worst compactness_worst concavity_worst
## Min. : 185.2 Min. :0.07117 Min. :0.02729 Min. :0.0000
## 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145
## Median : 686.5 Median :0.13130 Median :0.21190 Median :0.2267
## Mean : 880.6 Mean :0.13237 Mean :0.25427 Mean :0.2722
## 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829
## Max. :4254.0 Max. :0.22260 Max. :1.05800 Max. :1.2520
## concave_points_worst symmetry_worst fractal_dimension_worst
## Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.29100 Max. :0.6638 Max. :0.20750
I dedicate this project to my beloved mother, whose endless love, prayers, sacrifices, and unwavering support have been a source of strength throughout my academic journey. Your care and encouragement continue to inspire me to strive for excellence in all that I do.
To my father, who has remained a strong pillar in my life, thank you for your guidance, support, discipline, and constant belief in my potential. Your efforts and sacrifices have laid the foundation for my growth and success.
I also dedicate this work to my darling girlfriend, Zahra Ayunie💕❤️, an amazing and supportive soul who has stood by me through thick and thin. Thank you for your love, encouragement, advice, and for sharing dreams and aspirations with me. Your presence has brought motivation, comfort, and happiness into my journey. I pray that Allah grants us success and makes our dreams come to pass, Insha’Allah.
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22621)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Africa/Lagos
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gmodels_2.19.1 class_7.3-23 patchwork_1.3.2 reshape2_1.4.5
## [5] corrplot_0.95 GGally_2.4.0 skimr_2.2.2 janitor_2.2.1
## [9] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.0
## [13] purrr_1.2.1 readr_2.2.0 tidyr_1.3.2 tibble_3.3.1
## [17] ggplot2_4.0.2 tidyverse_2.0.0 rio_1.2.4
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 gtools_3.9.5
## [4] stringi_1.8.7 hms_1.1.4 digest_0.6.39
## [7] magrittr_2.0.4 evaluate_1.0.5 grid_4.5.2
## [10] timechange_0.4.0 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] plyr_1.8.9 R.oo_1.27.1 jsonlite_2.0.0
## [16] R.utils_2.13.0 scales_1.4.0 jquerylib_0.1.4
## [19] cli_3.6.5 rlang_1.1.7 R.methodsS3_1.8.2
## [22] base64enc_0.1-6 withr_3.0.2 repr_1.1.7
## [25] cachem_1.1.0 yaml_2.3.12 tools_4.5.2
## [28] tzdb_0.5.0 ggstats_0.13.0 vctrs_0.7.1
## [31] R6_2.6.1 lifecycle_1.0.5 snakecase_0.11.1
## [34] MASS_7.3-65 pkgconfig_2.0.3 pillar_1.11.1
## [37] bslib_0.10.0 gtable_0.3.6 Rcpp_1.1.1
## [40] glue_1.8.0 data.table_1.18.2.1 xfun_0.56
## [43] tidyselect_1.2.1 rstudioapi_0.18.0 knitr_1.51
## [46] farver_2.1.2 htmltools_0.5.9 gdata_3.0.1
## [49] labeling_0.4.3 rmarkdown_2.30 compiler_4.5.2
## [52] S7_0.2.1