Clustering Breast Cancer Coimbra Data Set

1. Targeting treatment for breast cancer disease patients

Comprehending the way elements associate with one another is crucial in a multitude of fields. For instance, advertisers aim to recognize commonalities among their target audience to tailor their promotional efforts, while taxonomists arrange flora based on shared attributes. Clustering algorithms provide a method of grouping elements. In this project, we will examine the effectiveness of unsupervised clustering algorithms in assisting medical practitioners to determine appropriate remedies for their patients.

We will be grouping the confidential information of individuals diagnosed with breast cancer using clustering algorithms. This approach may reveal correlations between patient characteristics and treatment outcomes, which could benefit medical practitioners in their treatment decisions. The data we will be utilizing is sourced from the Portuguese Foundation for Science and Technology

To ensure a successful analysis, it is imperative to familiarize ourselves with the data’s structure. The clustering algorithms utilized in this project necessitate numerical data, thus we will verify that all data points meet this requirement.

Load the necessary libraries

library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ dplyr   1.0.9
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ✔ purrr   0.3.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidyr)
library(ggplot2)

Let’s load the data

breast <- read_csv("~/Desktop/Seto/UL/Project/Clustering/dataR2.csv")
## Rows: 116 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Age, BMI, Glucose, Insulin, HOMA, Leptin, Adiponectin, Resistin, M...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Print the first 10 observations of the dataset

head(breast)
## # A tibble: 6 × 10
##     Age   BMI Glucose Insulin  HOMA Leptin Adiponectin Resistin MCP.1 Classifi…¹
##   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl>       <dbl>    <dbl> <dbl>      <dbl>
## 1    48  23.5      70    2.71 0.467   8.81        9.70     8.00  417.          1
## 2    83  20.7      92    3.12 0.707   8.84        5.43     4.06  469.          1
## 3    82  23.1      91    4.50 1.01   17.9        22.4      9.28  555.          1
## 4    68  21.4      77    3.23 0.613   9.88        7.17    12.8   928.          1
## 5    86  21.1      92    3.55 0.805   6.70        4.82    10.6   774.          1
## 6    49  22.9      92    3.23 0.732   6.83       13.7     10.3   530.          1
## # … with abbreviated variable name ¹​Classification

There are 10 variables within the dataset.

  • Age (years)
  • BMI (kg/m2)
  • Glucose (mg/dL)
  • Insulin (µU/mL)
  • HOMA
  • Leptin (ng/mL)
  • Adiponectin (µg/mL)
  • Resistin (ng/mL)
  • MCP-1(pg/dL)
  • Classification

Since we are going to re-cluster the dataset using clustering algorithm, let’s omit the calssification column

breast <- breast[,1:9]

Now we have omitted the label of our dataset.

2. Measuring patient differences

We will conduct Exploratory Data Analysis to familiarize ourselves with the data before performing the clustering algorithm.

Let’s check the statistics summary to see if we should scale the dataset

summary(breast)
##       Age            BMI           Glucose          Insulin      
##  Min.   :24.0   Min.   :18.37   Min.   : 60.00   Min.   : 2.432  
##  1st Qu.:45.0   1st Qu.:22.97   1st Qu.: 85.75   1st Qu.: 4.359  
##  Median :56.0   Median :27.66   Median : 92.00   Median : 5.925  
##  Mean   :57.3   Mean   :27.58   Mean   : 97.79   Mean   :10.012  
##  3rd Qu.:71.0   3rd Qu.:31.24   3rd Qu.:102.00   3rd Qu.:11.189  
##  Max.   :89.0   Max.   :38.58   Max.   :201.00   Max.   :58.460  
##       HOMA             Leptin        Adiponectin        Resistin     
##  Min.   : 0.4674   Min.   : 4.311   Min.   : 1.656   Min.   : 3.210  
##  1st Qu.: 0.9180   1st Qu.:12.314   1st Qu.: 5.474   1st Qu.: 6.882  
##  Median : 1.3809   Median :20.271   Median : 8.353   Median :10.828  
##  Mean   : 2.6950   Mean   :26.615   Mean   :10.181   Mean   :14.726  
##  3rd Qu.: 2.8578   3rd Qu.:37.378   3rd Qu.:11.816   3rd Qu.:17.755  
##  Max.   :25.0503   Max.   :90.280   Max.   :38.040   Max.   :82.100  
##      MCP.1        
##  Min.   :  45.84  
##  1st Qu.: 269.98  
##  Median : 471.32  
##  Mean   : 534.65  
##  3rd Qu.: 700.09  
##  Max.   :1698.44

Since there is vast difference in the range between some variables (for example, MCP.1 ranging in hundreds and HOMa ranging in ones). Therefore, we should scale the data to bring every feature in the same footing withut any upfront importance.

breast_scaled <- scale(breast, scale = TRUE)

Let’s look at the data after scaling

summary(breast_scaled)
##       Age                BMI             Glucose           Insulin       
##  Min.   :-2.06679   Min.   :-1.8350   Min.   :-1.6778   Min.   :-0.7529  
##  1st Qu.:-0.76348   1st Qu.:-0.9181   1st Qu.:-0.5347   1st Qu.:-0.5615  
##  Median :-0.08079   Median : 0.0160   Median :-0.2572   Median :-0.4060  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.85015   3rd Qu.: 0.7289   3rd Qu.: 0.1868   3rd Qu.: 0.1169  
##  Max.   : 1.96728   Max.   : 2.1905   Max.   : 4.5818   Max.   : 4.8122  
##       HOMA             Leptin         Adiponectin         Resistin      
##  Min.   :-0.6116   Min.   :-1.1627   Min.   :-1.2457   Min.   :-0.9294  
##  1st Qu.:-0.4879   1st Qu.:-0.7455   1st Qu.:-0.6878   1st Qu.:-0.6331  
##  Median :-0.3608   Median :-0.3307   Median :-0.2671   Median :-0.3146  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.0447   3rd Qu.: 0.5611   3rd Qu.: 0.2389   3rd Qu.: 0.2445  
##  Max.   : 6.1381   Max.   : 3.3188   Max.   : 4.0710   Max.   : 5.4375  
##      MCP.1        
##  Min.   :-1.4131  
##  1st Qu.:-0.7651  
##  Median :-0.1831  
##  Mean   : 0.0000  
##  3rd Qu.: 0.4783  
##  Max.   : 3.3644

3. Let’s start grouping the patients and healthy controls

With the scaled data, we can start to perform the clustering algorithm.

We will conduct two clustering algorithm in this project exam:

  1. K-Means Clustering
  2. Hierarchical Clustering

4. K-Means CLustering

K-Means clustering is the most popular unsupervised machine learning algorithm.

Set the seed so that our results are reproducible

seed_value <- 1234
set.seed(seed_value)

In this algorithm, we should specify the number of cluster (k)

k <- 4

let’s run the k-means algorithm

first_cluster = kmeans(breast_scaled, centers = k, nstart = 1)

Let’s see how many patients are in each cluster

first_cluster$size
## [1] 42  5 25 44
Another round of K-Means
seed_value <- 12345
set.seed(seed_value)

k <- 2

second_cluster = kmeans(breast_scaled, centers = k, nstart = 1)

second_cluster$size
## [1] 75 41

Comparing patient clusters

Let’s add the cluster assignments to the data

breast["first_cluster"] <- first_cluster$cluster
breast["second_cluster"] <- second_cluster$cluster

Now, to see it clearly, let’s plot the data with the first and second cluster.

first_plot <- ggplot(breast, aes(x=Age, y=Insulin, color=as.factor(first_cluster))) + geom_point()

first_plot

second_plot <- ggplot(breast, aes(x=Age, y=Insulin, color=as.factor(second_cluster))) + geom_point()

second_plot

5. Hierarchical CLustering

Complete linkage

first_hier_cluster <- hclust(dist(breast_scaled),method = "complete")

plot(first_hier_cluster)

hc1_assign <- cutree(first_hier_cluster, 3)

Single linkage

second_hier_cluster <- hclust(dist(breast_scaled),method = "single")

plot(second_hier_cluster)

hc2_assign <- cutree(second_hier_cluster, 3)

7. Commparing Clustering Results

Add assignment of chosend hierarchical linkage

breast["hc_cluster"] <- hc1_assign

hd_simple <- breast[,!(names(breast) %in% c("first_cluster","second_cluster"))]

cluster_summary <- do.call(data.frame, aggregate(. ~ hc_cluster, data = hd_simple, function(x) c(avg = mean(x), sd = sd(x))))

cluster_summary
##   hc_cluster  Age.avg    Age.sd  BMI.avg   BMI.sd Glucose.avg Glucose.sd
## 1          1 55.87619 15.839760 27.29589 5.055650    94.28571  15.280527
## 2          2 70.12500  8.919281 30.98131 4.220091   106.00000  15.565530
## 3          3 73.00000 21.656408 28.53515 2.406033   198.66667   2.516611
##   Insulin.avg Insulin.sd  HOMA.avg  HOMA.sd Leptin.avg Leptin.sd
## 1    9.118981   9.091924  2.232642 2.491134   22.77857  14.29951
## 2   12.290250   5.122041  3.317516 1.784610   69.86734  17.19186
## 3   35.195667  20.589746 17.216999 9.987783   45.55360  26.43773
##   Adiponectin.avg Adiponectin.sd Resistin.avg Resistin.sd MCP.1.avg MCP.1.sd
## 1       10.527127       7.035014     13.31998    8.713109  525.9755 308.0576
## 2        7.283759       3.652193     22.04973   28.976806  289.5481 238.4204
## 3        5.787642       1.935047     44.40540   17.369156 1491.7463 358.0039

8. Visualizing Cluster Contents

Let’s plot Age and Insulin

plot1 <- ggplot(breast, aes(x=Age, y=Insulin, color=as.factor(hc_cluster))) + geom_point()

plot1

Let’s plot Adiponectin and MCP.1

plot2 <- ggplot(breast, aes(x=Adiponectin, y=MCP.1, color=as.factor(hc_cluster))) + geom_point()

plot2

9. Conclusion

The best choice of algorithm for the dataset based on the accuracy occured to be the K-Means algorithm with k=2.

10. Bibliography

Hayden, L.(2018, August). Principal Component Analysis in R Tutorial. Datacamp. https://www.datacamp.com/tutorial/pca-analysis-r

IBM. (2020). Kaiser-Meyer-Olkin measure for identity correlation matrix. https://www.ibm.com/support/pages/kaiser-meyer-olkin-measure-identity-correlation-matrix

Jaiswal, S. (2018, March). K-Means Clustering in R Tutorial. Datacamp. https://www.datacamp.com/tutorial/k-means-clustering-r

Patrício, M., Pereira, J., Crisóstomo, J. et al. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 18, 29 (2018). https://doi.org/10.1186/s12885-017-3877-1

Roy, B. (2020, April). All about Feature Scaling. Towards Data Science. https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35

UCI Machine Learning Repository. (2018). Breast Cancer Coimbra Data Set. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra