Glioblastoma is the most deadly type of brain cancer. Usually affects 50 and 70 year old population, and it is one of the fastest growing cancer since it affects the brain glia cells (often fast-growing and invasive). This time we will try to create a clustering and hierarchichal classification of patients.
glio
We will use the public database from Columbia University:
| molecular_subtype | extent_of_resection | surgery | mgmt_methylation | survival_days | egfr_amplification | initial_kps | age_in_years |
|---|---|---|---|---|---|---|---|
| Proneural | Complete | primary | Yes | NA | No | 100 | 44 yrs |
| Classical, Mesenchymal | Complete | primary | Yes | 1076 | No | 100 | 57 yrs |
| Classical | Complete | primary | No | 105 | Yes | 100 | 66 yrs |
| Classical | Sub-total | primary | No | 80 | Yes | 70 | 61 yrs |
| Mesenchymal | Complete | primary | No | 250 | Yes | 90 | 59 yrs |
| Neural, Proneural | Complete | primary | No | 353 | Yes | 100 | 76 yrs |
| Sub-total | primary | Yes | 903 | No | 90 | 36 yrs | |
| Proneural | Complete | primary | No | 615 | No | 80 | 67 yrs |
| Mesenchymal | Complete | recurrent | No | 363 | No | 90 | 43 yrs |
| Classical, Neural | Complete | primary | Yes | 1096 | Yes | 90 | 64 yrs |
dim(tumor)
## [1] 42 10
str(tumor)
## 'data.frame': 42 obs. of 10 variables:
## $ donor_id : int 12111 12112 10865 12165 12877 12996 159992499 159994896 277805907 10926 ...
## $ tumor_name : Factor w/ 42 levels "W1-1-2","W10-1-1",..: 2 3 1 4 5 6 7 8 10 9 ...
## $ molecular_subtype : Factor w/ 9 levels "","Classical",..: 9 3 2 2 5 8 1 9 5 4 ...
## $ extent_of_resection: Factor w/ 2 levels "Complete","Sub-total": 1 1 1 2 1 1 2 1 1 1 ...
## $ surgery : Factor w/ 2 levels "primary","recurrent": 1 1 1 1 1 1 1 1 2 1 ...
## $ mgmt_methylation : Factor w/ 3 levels "","No","Yes": 3 3 2 2 2 2 3 2 2 3 ...
## $ survival_days : int NA 1076 105 80 250 353 903 615 363 1096 ...
## $ egfr_amplification : Factor w/ 3 levels "","No","Yes": 2 2 3 3 3 3 2 2 2 3 ...
## $ initial_kps : int 100 100 100 70 90 100 90 80 90 90 ...
## $ age_in_years : Factor w/ 24 levels "17 yrs","26 yrs",..: 5 12 19 15 13 24 3 20 4 17 ...
summary(tumor)
## donor_id tumor_name molecular_subtype
## Min. : 10865 W1-1-2 : 1 Proneural :8
## 1st Qu.: 12125 W10-1-1: 1 Classical :7
## Median : 14736 W11-1-1: 1 :5
## Mean : 71705836 W12-1-1: 1 Mesenchymal :5
## 3rd Qu.:159855905 W13-1-1: 1 Classical, Mesenchymal:4
## Max. :292023109 W16-1-1: 1 Classical, Neural :4
## (Other):36 (Other) :9
## extent_of_resection surgery mgmt_methylation survival_days
## Complete :34 primary :39 : 2 Min. : 62.0
## Sub-total: 8 recurrent: 3 No :23 1st Qu.: 257.5
## Yes:17 Median : 439.5
## Mean : 495.2
## 3rd Qu.: 664.5
## Max. :1293.0
## NA's :10
## egfr_amplification initial_kps age_in_years
## :11 Min. : 60.00 64 yrs : 5
## No :16 1st Qu.: 80.00 52 yrs : 3
## Yes:15 Median : 90.00 57 yrs : 3
## Mean : 87.86 59 yrs : 3
## 3rd Qu.:100.00 61 yrs : 3
## Max. :100.00 36 yrs : 2
## (Other):23
Here we have a 42x10 database. It is rather a small sample, but maybe we can make a good classification even though. But first of all, we have to do something with the age_in_years variable. We have to convert it into a numeric variable.
We have observed some problems with the original database. There are too many factorial strings and just two columns of integers (survival_days and initial_kps). We have to convert them into numeric columns in order to continue our analysis properly:
tumor2 <- tumor
age <- gsub(tumor$age_in_years, pattern = "yrs", replacement = "")
tumor2$age_in_years <- as.numeric(age)
tumor2$extent_of_resection <- as.numeric(tumor$extent_of_resection)
tumor2$surgery <- as.numeric(tumor$surgery)
tumor2$mgmt_methylation <- as.numeric(tumor$mgmt_methylation)
tumor2$egfr_amplification <- as.numeric(tumor$egfr_amplification)
In addition to this we have up to 10 NA values in the survival_days variable. We have to do something with this if we want to create a predictive model. We have four options here:
In first instance, let’s explore the correlation method. To this effect we have to assess whether there are significant correlations or not.
library(PerformanceAnalytics)
chart.Correlation(tumor2[,c("survival_days", "initial_kps", "age_in_years")], histogram = TRUE, pch = 19)
There is not a significant correlation pair. We cannot use this method to fill NA values. Let’s look for similarites not between columns but between rows. To do this we can use knnImputation() function from the DMwR library.
library(DMwR)
tumor3 <- knnImputation(tumor2, k = 10)
Now we have our database free of NA.
library(car)
par(mfrow = c(1, 2))
hist(tumor3$survival_days, prob = TRUE, main = "Survival days")
lines(density(tumor3$survival_days, na.rm = TRUE), col = "red")
rug(jitter(tumor3$survival_days))
qqPlot(tumor3$survival_days, main = "Survival days QQ-plot")
## [1] 14 10
The distribution doesn’t follow a normal distribution. This could be due to the small size of the sample, so we have to be cautious. Moreover, the QQ-plot represents some outliers.
par(mfrow = c(1, 3))
boxplot(tumor3$survival_days, main = "Survival days boxplot")
rug(jitter(tumor3$survival_days), side = 2)
abline(h = mean(tumor3$survival_days, na.rm = TRUE), col = "blue", lty = 2)
boxplot(tumor3$age_in_years, main = "Age boxplot")
rug(jitter(tumor3$age_in_years), side = 2)
abline(h = mean(tumor3$age_in_years, na.rm = TRUE), col = "blue", lty = 2)
boxplot(tumor3$initial_kps, main = "Initial KPS boxplot")
rug(jitter(tumor3$initial_kps), side = 2)
abline(h = mean(tumor3$initial_kps, na.rm = TRUE), col = "blue", lty = 2)
Just with these three boxplot we can make a profile of the standard glioblastoma: an adult between 50 - 65 years old, with no remarkable problem with his daily life and who has a survival expectancy between 300 - 700 days, with some unusual cases of long survival rate.
The objective of our study is to make a classification of the different patients’ clusters. The models we will make are: * Hierarchical Classification (HC) * Non-Hierarchical Classification (NHC) or K-means clustering.
First up, we need to standarize the data to compare between scales.
tumor.norm <- decostand(tumor3[,4:10], na.rm = TRUE, "hellinger")
tumor.ch <- vegdist(tumor.norm, "euc")
There are few options to conglomerate the data, so let’s take a look one by one.
tumor.ch.single <- hclust(tumor.ch, method = "single")
tumor.ch.complete <- hclust(tumor.ch, method = "complete")
tumor.ch.upgma <- hclust(tumor.ch, method = "average")
tumor.ch.cent <- hclust(tumor.ch, method = "centroid")
tumor.ch.ward <- hclust(tumor.ch, method = "ward.D")
tumor.ch.sqrt <- sqrt(tumor.ch.ward$height)
tumor.ch.single.coph <- cophenetic(tumor.ch.single)
cor(tumor.ch, tumor.ch.single.coph)
## [1] 0.8452938
tumor.ch.complete.coph <- cophenetic(tumor.ch.complete)
cor(tumor.ch, tumor.ch.complete.coph)
## [1] 0.625307
tumor.ch.upgma.coph <- cophenetic(tumor.ch.upgma)
cor(tumor.ch, tumor.ch.upgma.coph)
## [1] 0.8258267
tumor.ch.cent.coph <- cophenetic(tumor.ch.cent)
cor(tumor.ch, tumor.ch.cent.coph)
## [1] 0.8225013
tumor.ch.ward.coph <- cophenetic(tumor.ch.ward)
cor(tumor.ch, tumor.ch.ward.coph)
## [1] 0.5974864
par(mfrow=c(2, 3))
plot(tumor.ch, tumor.ch.single.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
main = c("Single linkage", paste("Cophenetic correlation =",
round(cor(tumor.ch, tumor.ch.single.coph), 3))))
abline(0,1); lines(lowess(tumor.ch, tumor.ch.single.coph), col = "red")
plot(tumor.ch, tumor.ch.complete.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
main = c("Complete linkage", paste("Cophenetic correlation =",
round(cor(tumor.ch, tumor.ch.complete.coph), 3))))
abline(0,1); lines(lowess(tumor.ch, tumor.ch.complete.coph), col = "red")
plot(tumor.ch, tumor.ch.upgma.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
main = c("UPGMA linkage", paste("Cophenetic correlation =",
round(cor(tumor.ch, tumor.ch.upgma.coph), 3))))
abline(0,1); lines(lowess(tumor.ch, tumor.ch.upgma.coph), col = "red")
plot(tumor.ch, tumor.ch.cent.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
main = c("Centroid linkage", paste("Cophenetic correlation =",
round(cor(tumor.ch, tumor.ch.cent.coph), 3))))
abline(0,1); lines(lowess(tumor.ch, tumor.ch.cent.coph), col = "red")
plot(tumor.ch, tumor.ch.ward.coph, xlab = "Chord distance",
ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
main = c("Ward linkage", paste("Cophenetic correlation =",
round(cor(tumor.ch, tumor.ch.ward.coph), 3))))
abline(0,1); lines(lowess(tumor.ch, tumor.ch.ward.coph), col = "red")
Giving its cophenetic distance parameter, the best method is UPGMA. Also we could use the Gower distance method to check which is the best model.
plot(tumor.ch.upgma)
Now we now which is the best conglomerate method, we need to get the best number of clusters. To do so, we will create a for loop with the different options we have and we will representate it in a silhoutte plot.
asw <- numeric(nrow(tumor3))
for (k in 2:(nrow(tumor3)-1)) {
sil <- silhouette(cutree(tumor.ch.upgma, k = k), tumor.ch)
asw[k] <- summary(sil)$avg.width
}
k.best <- which.max(asw)
plot(1:nrow(tumor3), asw, type="h",
main="Silhouette-optimal number of clusters, Ward",
xlab="k (number of groups)", ylab="Average silhouette width")
axis(1, k.best, paste("optimum",k.best,sep="\n"), col="red", font=2,
col.axis="red")
points(k.best, max(asw), pch=16, col="red", cex=1.5)
The optimal number of clusters is k = 2.
Let’s try our final plot for k = 2 and k = 3 and see the differences between them.
par(mfrow = c(1, 2))
k <- 2
cutg <- cutree(tumor.ch.upgma, k=k)
sil <- silhouette(cutg, tumor.ch)
rownames(sil) <- row.names(tumor3)
plot(sil, main="Silhouette plot - Chord - Ward",
cex.names=0.8, col=2:(k+1), nmax=100)
k <- 3
cutg <- cutree(tumor.ch.upgma, k=k)
sil <- silhouette(cutg, tumor.ch)
rownames(sil) <- row.names(tumor3)
plot(sil, main="Silhouette plot - Chord - Ward",
cex.names=0.8, col=2:(k+1), nmax=100)
K = 3 doesn’t work as bad, and just one value was misclassified. We are going to carry on doing our analysis with k = 3 since the results can be more self-explanatory and more adjusted to the theory.
hcoplot <- function(tree, diss, k,
title = paste("Reordered dendrogram from", deparse(tree$call), sep = "\n"))
{
require(gclus)
gr <- cutree(tree, k = k)
tor <- reorder.hclust(tree, diss)
plot(tor, hang = -1, xlab = paste(length(gr),"sites"), sub = paste(k, "clusters"),
main = title)
so <- gr[tor$order]
gro <- numeric(k)
for (i in 1:k)
{
gro[i] <- so[1]
if (i<k) so <- so[so != gro[i]]
}
rect.hclust(tor, k = k, border = gro + 1, cluster = gr)
legend("topright", paste("Cluster", 1:k), pch = 22, col = 2:(k + 1), bty = "n")
}
hcoplot(tumor.ch.upgma, tumor.ch, k = 3)
We will start creating our model with k = 3 just like the previous HC model.
set.seed(1)
(tumor.kmeans <- kmeans(tumor.norm, centers = 3, nstart = 100))
## K-means clustering with 3 clusters of sizes 18, 6, 18
##
## Cluster means:
## extent_of_resection surgery mgmt_methylation survival_days
## 1 0.03788469 0.03480502 0.05284943 0.9054084
## 2 0.06744738 0.06731479 0.08878770 0.6278674
## 3 0.04911728 0.04808323 0.07153485 0.8193016
## egfr_amplification initial_kps age_in_years
## 1 0.05035501 0.3200954 0.2550330
## 2 0.08886701 0.5792539 0.4868995
## 3 0.06319629 0.4369001 0.3465407
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1 1 2 2 3 3 1 1 3 1 1 1 3 1 2 3 3 1 1 1 3 3 3 3 1
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
## 3 1 2 1 3 3 1 1 3 3 3 2 3 1 3 1 2
##
## Within cluster sum of squares by cluster:
## [1] 0.08563442 0.04989224 0.07086946
## (between_SS / total_SS = 82.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
tumor.KM.cascade <- cascadeKM(tumor.norm, inf.gr = 2, sup.gr = 10, iter = 100,
criterion = "ssi")
plot(tumor.KM.cascade, sortg = TRUE)
We can explain up to a 82.2% of the patients’ variance. At this point we must ask ourselves if we can improve our model changing the number of clusters. Luckily we can observe this writing a for loop for the different possible number of clusters.
asw <- numeric(nrow(tumor3))
for (k in 2:(nrow(tumor3)-1))
asw[k] <- pam(tumor.ch, k, diss=TRUE)$silinfo$avg.width
k.best <- which.max(asw)
cat("", "Silhouette-optimal number of clusters k =", k.best, "\n",
"with an average silhouette width of", max(asw), "\n")
## Silhouette-optimal number of clusters k = 2
## with an average silhouette width of 0.4778439
plot(1:nrow(tumor3), asw, type="h", main="Choice of the number of clusters",
xlab="k (number of clusters)", ylab="Average silhouette width")
axis(1, k.best, paste("optimum", k.best, sep="\n"), col="red", font=2,
col.axis="red")
points(k.best, max(asw), pch=16, col="red", cex=1.5)
As in the previous analysis, the ideal number of clusters suggested by the silhouette plot is somewhere between 2 and 4.
clusplot(tumor.norm, tumor.kmeans$cluster, color= TRUE, shade = TRUE, labels = 2,
lines = 0)
There are three well defined groups. This model can explain up to a 88% of the variability, so we have a satisfying model, even better than the previous NHC model.
There are a lot of analysis we can make through calculating the differences between clusters. As an example, let’s explore the difference in survival days between the three patients groups:
group1 <- tumor3[tumor.kmeans$cluster == 1, ]
group2 <- tumor3[tumor.kmeans$cluster == 2, ]
group3 <- tumor3[tumor.kmeans$cluster == 3, ]
summary(group1[,6:10]); summary(group2[,6:10]); summary(group3[,6:10])
## mgmt_methylation survival_days egfr_amplification initial_kps
## Min. :1.000 Min. : 442.0 Min. :1.000 Min. : 60.00
## 1st Qu.:2.000 1st Qu.: 540.5 1st Qu.:2.000 1st Qu.: 80.00
## Median :2.000 Median : 624.0 Median :2.000 Median : 90.00
## Mean :2.389 Mean : 730.3 Mean :2.167 Mean : 85.56
## 3rd Qu.:3.000 3rd Qu.: 897.2 3rd Qu.:3.000 3rd Qu.: 90.00
## Max. :3.000 Max. :1293.0 Max. :3.000 Max. :100.00
## age_in_years
## Min. :17.00
## 1st Qu.:50.25
## Median :57.00
## Mean :54.72
## 3rd Qu.:63.75
## Max. :72.00
## mgmt_methylation survival_days egfr_amplification initial_kps
## Min. :2 Min. : 62.0 Min. :1.000 Min. : 70.00
## 1st Qu.:2 1st Qu.: 74.0 1st Qu.:1.250 1st Qu.: 75.00
## Median :2 Median : 92.5 Median :2.500 Median : 90.00
## Mean :2 Mean :108.0 Mean :2.167 Mean : 86.67
## 3rd Qu.:2 3rd Qu.:135.0 3rd Qu.:3.000 3rd Qu.: 97.50
## Max. :2 Max. :184.0 Max. :3.000 Max. :100.00
## age_in_years
## Min. :50.00
## 1st Qu.:58.00
## Median :62.50
## Mean :60.33
## 3rd Qu.:64.00
## Max. :66.00
## mgmt_methylation survival_days egfr_amplification initial_kps
## Min. :1.000 Min. :197.0 Min. :1 Min. : 70.00
## 1st Qu.:2.000 1st Qu.:291.4 1st Qu.:1 1st Qu.: 90.00
## Median :2.500 Median :316.8 Median :2 Median : 90.00
## Mean :2.444 Mean :323.3 Mean :2 Mean : 90.56
## 3rd Qu.:3.000 3rd Qu.:360.5 3rd Qu.:3 3rd Qu.:100.00
## Max. :3.000 Max. :437.0 Max. :3 Max. :100.00
## age_in_years
## Min. :26.00
## 1st Qu.:52.50
## Median :59.50
## Mean :57.56
## 3rd Qu.:64.00
## Max. :76.00
Based on the results of this evaluation we can make the following conclusions: