Comprehending the way elements associate with one another is crucial in a multitude of fields. For instance, advertisers aim to recognize commonalities among their target audience to tailor their promotional efforts, while taxonomists arrange flora based on shared attributes. Clustering algorithms provide a method of grouping elements. In this project, we will examine the effectiveness of unsupervised clustering algorithms in assisting medical practitioners to determine appropriate remedies for their patients.
We will be grouping the confidential information of individuals diagnosed with breast cancer using clustering algorithms. This approach may reveal correlations between patient characteristics and treatment outcomes, which could benefit medical practitioners in their treatment decisions. The data we will be utilizing is sourced from the Portuguese Foundation for Science and Technology
To ensure a successful analysis, it is imperative to familiarize ourselves with the data’s structure. The clustering algorithms utilized in this project necessitate numerical data, thus we will verify that all data points meet this requirement.
Load the necessary libraries
library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ dplyr 1.0.9
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidyr)
library(ggplot2)
Let’s load the data
breast <- read_csv("~/Desktop/Seto/UL/Project/Clustering/dataR2.csv")
## Rows: 116 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Age, BMI, Glucose, Insulin, HOMA, Leptin, Adiponectin, Resistin, M...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Print the first 10 observations of the dataset
head(breast)
## # A tibble: 6 × 10
## Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin MCP.1 Classifi…¹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 48 23.5 70 2.71 0.467 8.81 9.70 8.00 417. 1
## 2 83 20.7 92 3.12 0.707 8.84 5.43 4.06 469. 1
## 3 82 23.1 91 4.50 1.01 17.9 22.4 9.28 555. 1
## 4 68 21.4 77 3.23 0.613 9.88 7.17 12.8 928. 1
## 5 86 21.1 92 3.55 0.805 6.70 4.82 10.6 774. 1
## 6 49 22.9 92 3.23 0.732 6.83 13.7 10.3 530. 1
## # … with abbreviated variable name ¹Classification
There are 10 variables within the dataset.
Since we are going to re-cluster the dataset using clustering algorithm, let’s omit the calssification column
breast <- breast[,1:9]
Now we have omitted the label of our dataset.
We will conduct Exploratory Data Analysis to familiarize ourselves with the data before performing the clustering algorithm.
Let’s check the statistics summary to see if we should scale the dataset
summary(breast)
## Age BMI Glucose Insulin
## Min. :24.0 Min. :18.37 Min. : 60.00 Min. : 2.432
## 1st Qu.:45.0 1st Qu.:22.97 1st Qu.: 85.75 1st Qu.: 4.359
## Median :56.0 Median :27.66 Median : 92.00 Median : 5.925
## Mean :57.3 Mean :27.58 Mean : 97.79 Mean :10.012
## 3rd Qu.:71.0 3rd Qu.:31.24 3rd Qu.:102.00 3rd Qu.:11.189
## Max. :89.0 Max. :38.58 Max. :201.00 Max. :58.460
## HOMA Leptin Adiponectin Resistin
## Min. : 0.4674 Min. : 4.311 Min. : 1.656 Min. : 3.210
## 1st Qu.: 0.9180 1st Qu.:12.314 1st Qu.: 5.474 1st Qu.: 6.882
## Median : 1.3809 Median :20.271 Median : 8.353 Median :10.828
## Mean : 2.6950 Mean :26.615 Mean :10.181 Mean :14.726
## 3rd Qu.: 2.8578 3rd Qu.:37.378 3rd Qu.:11.816 3rd Qu.:17.755
## Max. :25.0503 Max. :90.280 Max. :38.040 Max. :82.100
## MCP.1
## Min. : 45.84
## 1st Qu.: 269.98
## Median : 471.32
## Mean : 534.65
## 3rd Qu.: 700.09
## Max. :1698.44
Since there is vast difference in the range between some variables (for example, MCP.1 ranging in hundreds and HOMa ranging in ones). Therefore, we should scale the data to bring every feature in the same footing withut any upfront importance.
breast_scaled <- scale(breast, scale = TRUE)
Let’s look at the data after scaling
summary(breast_scaled)
## Age BMI Glucose Insulin
## Min. :-2.06679 Min. :-1.8350 Min. :-1.6778 Min. :-0.7529
## 1st Qu.:-0.76348 1st Qu.:-0.9181 1st Qu.:-0.5347 1st Qu.:-0.5615
## Median :-0.08079 Median : 0.0160 Median :-0.2572 Median :-0.4060
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.85015 3rd Qu.: 0.7289 3rd Qu.: 0.1868 3rd Qu.: 0.1169
## Max. : 1.96728 Max. : 2.1905 Max. : 4.5818 Max. : 4.8122
## HOMA Leptin Adiponectin Resistin
## Min. :-0.6116 Min. :-1.1627 Min. :-1.2457 Min. :-0.9294
## 1st Qu.:-0.4879 1st Qu.:-0.7455 1st Qu.:-0.6878 1st Qu.:-0.6331
## Median :-0.3608 Median :-0.3307 Median :-0.2671 Median :-0.3146
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.0447 3rd Qu.: 0.5611 3rd Qu.: 0.2389 3rd Qu.: 0.2445
## Max. : 6.1381 Max. : 3.3188 Max. : 4.0710 Max. : 5.4375
## MCP.1
## Min. :-1.4131
## 1st Qu.:-0.7651
## Median :-0.1831
## Mean : 0.0000
## 3rd Qu.: 0.4783
## Max. : 3.3644
With the scaled data, we can start to perform the clustering algorithm.
We will conduct two clustering algorithm in this project exam:
K-Means clustering is the most popular unsupervised machine learning algorithm.
Set the seed so that our results are reproducible
seed_value <- 1234
set.seed(seed_value)
In this algorithm, we should specify the number of cluster (k)
k <- 4
let’s run the k-means algorithm
first_cluster = kmeans(breast_scaled, centers = k, nstart = 1)
Let’s see how many patients are in each cluster
first_cluster$size
## [1] 42 5 25 44
seed_value <- 12345
set.seed(seed_value)
k <- 2
second_cluster = kmeans(breast_scaled, centers = k, nstart = 1)
second_cluster$size
## [1] 75 41
Let’s add the cluster assignments to the data
breast["first_cluster"] <- first_cluster$cluster
breast["second_cluster"] <- second_cluster$cluster
Now, to see it clearly, let’s plot the data with the first and second cluster.
first_plot <- ggplot(breast, aes(x=Age, y=Insulin, color=as.factor(first_cluster))) + geom_point()
first_plot
second_plot <- ggplot(breast, aes(x=Age, y=Insulin, color=as.factor(second_cluster))) + geom_point()
second_plot
Complete linkage
first_hier_cluster <- hclust(dist(breast_scaled),method = "complete")
plot(first_hier_cluster)
hc1_assign <- cutree(first_hier_cluster, 3)
Single linkage
second_hier_cluster <- hclust(dist(breast_scaled),method = "single")
plot(second_hier_cluster)
hc2_assign <- cutree(second_hier_cluster, 3)
Add assignment of chosend hierarchical linkage
breast["hc_cluster"] <- hc1_assign
hd_simple <- breast[,!(names(breast) %in% c("first_cluster","second_cluster"))]
cluster_summary <- do.call(data.frame, aggregate(. ~ hc_cluster, data = hd_simple, function(x) c(avg = mean(x), sd = sd(x))))
cluster_summary
## hc_cluster Age.avg Age.sd BMI.avg BMI.sd Glucose.avg Glucose.sd
## 1 1 55.87619 15.839760 27.29589 5.055650 94.28571 15.280527
## 2 2 70.12500 8.919281 30.98131 4.220091 106.00000 15.565530
## 3 3 73.00000 21.656408 28.53515 2.406033 198.66667 2.516611
## Insulin.avg Insulin.sd HOMA.avg HOMA.sd Leptin.avg Leptin.sd
## 1 9.118981 9.091924 2.232642 2.491134 22.77857 14.29951
## 2 12.290250 5.122041 3.317516 1.784610 69.86734 17.19186
## 3 35.195667 20.589746 17.216999 9.987783 45.55360 26.43773
## Adiponectin.avg Adiponectin.sd Resistin.avg Resistin.sd MCP.1.avg MCP.1.sd
## 1 10.527127 7.035014 13.31998 8.713109 525.9755 308.0576
## 2 7.283759 3.652193 22.04973 28.976806 289.5481 238.4204
## 3 5.787642 1.935047 44.40540 17.369156 1491.7463 358.0039
Let’s plot Age and Insulin
plot1 <- ggplot(breast, aes(x=Age, y=Insulin, color=as.factor(hc_cluster))) + geom_point()
plot1
Let’s plot Adiponectin and MCP.1
plot2 <- ggplot(breast, aes(x=Adiponectin, y=MCP.1, color=as.factor(hc_cluster))) + geom_point()
plot2
The best choice of algorithm for the dataset based on the accuracy occured to be the K-Means algorithm with k=2.
Hayden, L.(2018, August). Principal Component Analysis in R Tutorial. Datacamp. https://www.datacamp.com/tutorial/pca-analysis-r
IBM. (2020). Kaiser-Meyer-Olkin measure for identity correlation matrix. https://www.ibm.com/support/pages/kaiser-meyer-olkin-measure-identity-correlation-matrix
Jaiswal, S. (2018, March). K-Means Clustering in R Tutorial. Datacamp. https://www.datacamp.com/tutorial/k-means-clustering-r
Patrício, M., Pereira, J., Crisóstomo, J. et al. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 18, 29 (2018). https://doi.org/10.1186/s12885-017-3877-1
Roy, B. (2020, April). All about Feature Scaling. Towards Data Science. https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35
UCI Machine Learning Repository. (2018). Breast Cancer Coimbra Data Set. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra