Glioblastoma is the most deadly type of brain cancer. Usually affects 50 and 70 year old population, and it is one of the fastest growing cancer since it affects the brain glia cells (often fast-growing and invasive). This time we will try to create a clustering and hierarchichal classification of patients.

glio

1. Collecting data

We will use the public database from Columbia University:

molecular_subtype	extent_of_resection	surgery	mgmt_methylation	survival_days	egfr_amplification	initial_kps	age_in_years
Proneural	Complete	primary	Yes	NA	No	100	44 yrs
Classical, Mesenchymal	Complete	primary	Yes	1076	No	100	57 yrs
Classical	Complete	primary	No	105	Yes	100	66 yrs
Classical	Sub-total	primary	No	80	Yes	70	61 yrs
Mesenchymal	Complete	primary	No	250	Yes	90	59 yrs
Neural, Proneural	Complete	primary	No	353	Yes	100	76 yrs
	Sub-total	primary	Yes	903	No	90	36 yrs
Proneural	Complete	primary	No	615	No	80	67 yrs
Mesenchymal	Complete	recurrent	No	363	No	90	43 yrs
Classical, Neural	Complete	primary	Yes	1096	Yes	90	64 yrs

dim(tumor)

## [1] 42 10

str(tumor)

## 'data.frame':    42 obs. of  10 variables:
##  $ donor_id           : int  12111 12112 10865 12165 12877 12996 159992499 159994896 277805907 10926 ...
##  $ tumor_name         : Factor w/ 42 levels "W1-1-2","W10-1-1",..: 2 3 1 4 5 6 7 8 10 9 ...
##  $ molecular_subtype  : Factor w/ 9 levels "","Classical",..: 9 3 2 2 5 8 1 9 5 4 ...
##  $ extent_of_resection: Factor w/ 2 levels "Complete","Sub-total": 1 1 1 2 1 1 2 1 1 1 ...
##  $ surgery            : Factor w/ 2 levels "primary","recurrent": 1 1 1 1 1 1 1 1 2 1 ...
##  $ mgmt_methylation   : Factor w/ 3 levels "","No","Yes": 3 3 2 2 2 2 3 2 2 3 ...
##  $ survival_days      : int  NA 1076 105 80 250 353 903 615 363 1096 ...
##  $ egfr_amplification : Factor w/ 3 levels "","No","Yes": 2 2 3 3 3 3 2 2 2 3 ...
##  $ initial_kps        : int  100 100 100 70 90 100 90 80 90 90 ...
##  $ age_in_years       : Factor w/ 24 levels "17 yrs","26 yrs",..: 5 12 19 15 13 24 3 20 4 17 ...

summary(tumor)

##     donor_id           tumor_name              molecular_subtype
##  Min.   :    10865   W1-1-2 : 1   Proneural             :8      
##  1st Qu.:    12125   W10-1-1: 1   Classical             :7      
##  Median :    14736   W11-1-1: 1                         :5      
##  Mean   : 71705836   W12-1-1: 1   Mesenchymal           :5      
##  3rd Qu.:159855905   W13-1-1: 1   Classical, Mesenchymal:4      
##  Max.   :292023109   W16-1-1: 1   Classical, Neural     :4      
##                      (Other):36   (Other)               :9      
##  extent_of_resection      surgery   mgmt_methylation survival_days   
##  Complete :34        primary  :39      : 2           Min.   :  62.0  
##  Sub-total: 8        recurrent: 3   No :23           1st Qu.: 257.5  
##                                     Yes:17           Median : 439.5  
##                                                      Mean   : 495.2  
##                                                      3rd Qu.: 664.5  
##                                                      Max.   :1293.0  
##                                                      NA's   :10      
##  egfr_amplification  initial_kps      age_in_years
##     :11             Min.   : 60.00   64 yrs : 5   
##  No :16             1st Qu.: 80.00   52 yrs : 3   
##  Yes:15             Median : 90.00   57 yrs : 3   
##                     Mean   : 87.86   59 yrs : 3   
##                     3rd Qu.:100.00   61 yrs : 3   
##                     Max.   :100.00   36 yrs : 2   
##                                      (Other):23

Here we have a 42x10 database. It is rather a small sample, but maybe we can make a good classification even though. But first of all, we have to do something with the age_in_years variable. We have to convert it into a numeric variable.

2. Preprocessing

We have observed some problems with the original database. There are too many factorial strings and just two columns of integers (survival_days and initial_kps). We have to convert them into numeric columns in order to continue our analysis properly:

tumor2 <- tumor
age <- gsub(tumor$age_in_years, pattern = "yrs", replacement = "")
tumor2$age_in_years <- as.numeric(age)

tumor2$extent_of_resection <- as.numeric(tumor$extent_of_resection)
tumor2$surgery <- as.numeric(tumor$surgery)
tumor2$mgmt_methylation <- as.numeric(tumor$mgmt_methylation)
tumor2$egfr_amplification <- as.numeric(tumor$egfr_amplification)

In addition to this we have up to 10 NA values in the survival_days variable. We have to do something with this if we want to create a predictive model. We have four options here:

Remove the values.
Fill it with correlation values.
Fill it exploring the similarity between variables.
Use tools to handle those values.

In first instance, let’s explore the correlation method. To this effect we have to assess whether there are significant correlations or not.

library(PerformanceAnalytics)
chart.Correlation(tumor2[,c("survival_days", "initial_kps", "age_in_years")], histogram = TRUE, pch = 19)

There is not a significant correlation pair. We cannot use this method to fill NA values. Let’s look for similarites not between columns but between rows. To do this we can use knnImputation() function from the DMwR library.

library(DMwR)
tumor3 <- knnImputation(tumor2, k = 10)

Now we have our database free of NA.

3. Exploratory analysis

library(car)
par(mfrow = c(1, 2))
hist(tumor3$survival_days, prob = TRUE, main = "Survival days")
lines(density(tumor3$survival_days, na.rm = TRUE), col = "red")
rug(jitter(tumor3$survival_days))
qqPlot(tumor3$survival_days, main = "Survival days QQ-plot")

## [1] 14 10

The distribution doesn’t follow a normal distribution. This could be due to the small size of the sample, so we have to be cautious. Moreover, the QQ-plot represents some outliers.

par(mfrow = c(1, 3))

boxplot(tumor3$survival_days, main = "Survival days boxplot")
rug(jitter(tumor3$survival_days), side = 2)
abline(h = mean(tumor3$survival_days, na.rm = TRUE), col = "blue", lty = 2)

boxplot(tumor3$age_in_years, main = "Age boxplot")
rug(jitter(tumor3$age_in_years), side = 2)
abline(h = mean(tumor3$age_in_years, na.rm = TRUE), col = "blue", lty = 2)

boxplot(tumor3$initial_kps, main = "Initial KPS boxplot")
rug(jitter(tumor3$initial_kps), side = 2)
abline(h = mean(tumor3$initial_kps, na.rm = TRUE), col = "blue", lty = 2)

Just with these three boxplot we can make a profile of the standard glioblastoma: an adult between 50 - 65 years old, with no remarkable problem with his daily life and who has a survival expectancy between 300 - 700 days, with some unusual cases of long survival rate.

4. Model creation

The objective of our study is to make a classification of the different patients’ clusters. The models we will make are: * Hierarchical Classification (HC) * Non-Hierarchical Classification (NHC) or K-means clustering.

4.1. Hierarchical classification (HC)

4.1.1. Standarization

First up, we need to standarize the data to compare between scales.

tumor.norm <- decostand(tumor3[,4:10], na.rm = TRUE, "hellinger")
tumor.ch <- vegdist(tumor.norm, "euc")

4.1.2. Conglomerate selection

There are few options to conglomerate the data, so let’s take a look one by one.

tumor.ch.single <- hclust(tumor.ch, method = "single")
tumor.ch.complete <- hclust(tumor.ch, method = "complete")
tumor.ch.upgma <- hclust(tumor.ch, method = "average")
tumor.ch.cent <- hclust(tumor.ch, method = "centroid")
tumor.ch.ward <- hclust(tumor.ch, method = "ward.D")
tumor.ch.sqrt <- sqrt(tumor.ch.ward$height)

tumor.ch.single.coph <- cophenetic(tumor.ch.single)
cor(tumor.ch, tumor.ch.single.coph)

## [1] 0.8452938

tumor.ch.complete.coph <- cophenetic(tumor.ch.complete)
cor(tumor.ch, tumor.ch.complete.coph)

## [1] 0.625307

tumor.ch.upgma.coph <- cophenetic(tumor.ch.upgma)
cor(tumor.ch, tumor.ch.upgma.coph)

## [1] 0.8258267

tumor.ch.cent.coph <- cophenetic(tumor.ch.cent)
cor(tumor.ch, tumor.ch.cent.coph)

## [1] 0.8225013

tumor.ch.ward.coph <- cophenetic(tumor.ch.ward)
cor(tumor.ch, tumor.ch.ward.coph)

## [1] 0.5974864

par(mfrow=c(2, 3))

plot(tumor.ch, tumor.ch.single.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
     main = c("Single linkage", paste("Cophenetic correlation =",
                                    round(cor(tumor.ch, tumor.ch.single.coph), 3))))
abline(0,1);  lines(lowess(tumor.ch, tumor.ch.single.coph), col = "red")

plot(tumor.ch, tumor.ch.complete.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
     main = c("Complete linkage", paste("Cophenetic correlation =",
                                      round(cor(tumor.ch, tumor.ch.complete.coph), 3))))
abline(0,1);  lines(lowess(tumor.ch, tumor.ch.complete.coph), col = "red")

plot(tumor.ch, tumor.ch.upgma.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
     main = c("UPGMA linkage", paste("Cophenetic correlation =",
                                      round(cor(tumor.ch, tumor.ch.upgma.coph), 3))))
abline(0,1);  lines(lowess(tumor.ch, tumor.ch.upgma.coph), col = "red")

plot(tumor.ch, tumor.ch.cent.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
     main = c("Centroid linkage", paste("Cophenetic correlation =",
                                      round(cor(tumor.ch, tumor.ch.cent.coph), 3))))
abline(0,1);  lines(lowess(tumor.ch, tumor.ch.cent.coph), col = "red")

plot(tumor.ch, tumor.ch.ward.coph, xlab = "Chord distance", 
     ylab = "Cophenetic distance", asp = 1, xlim = c(0, sqrt(2)), ylim = c(0, sqrt(2)),
     main = c("Ward linkage", paste("Cophenetic correlation =",
                                      round(cor(tumor.ch, tumor.ch.ward.coph), 3))))
abline(0,1);  lines(lowess(tumor.ch, tumor.ch.ward.coph), col = "red")

Giving its cophenetic distance parameter, the best method is UPGMA. Also we could use the Gower distance method to check which is the best model.

plot(tumor.ch.upgma)

4.1.3. Conglomerate number selection

Now we now which is the best conglomerate method, we need to get the best number of clusters. To do so, we will create a for loop with the different options we have and we will representate it in a silhoutte plot.

asw <- numeric(nrow(tumor3))
for (k in 2:(nrow(tumor3)-1)) {
  sil <- silhouette(cutree(tumor.ch.upgma, k = k), tumor.ch)
  asw[k] <- summary(sil)$avg.width
}
k.best <- which.max(asw)
plot(1:nrow(tumor3), asw, type="h",
     main="Silhouette-optimal number of clusters, Ward",
     xlab="k (number of groups)", ylab="Average silhouette width")
axis(1, k.best, paste("optimum",k.best,sep="\n"), col="red", font=2,
       col.axis="red")
points(k.best, max(asw), pch=16, col="red", cex=1.5)

The optimal number of clusters is k = 2.

4.1.4. Final plot

Let’s try our final plot for k = 2 and k = 3 and see the differences between them.

par(mfrow = c(1, 2))
k <- 2
cutg <- cutree(tumor.ch.upgma, k=k)
sil <- silhouette(cutg, tumor.ch)
rownames(sil) <- row.names(tumor3)
plot(sil, main="Silhouette plot - Chord - Ward",
     cex.names=0.8, col=2:(k+1), nmax=100)

k <- 3
cutg <- cutree(tumor.ch.upgma, k=k)
sil <- silhouette(cutg, tumor.ch)
rownames(sil) <- row.names(tumor3)
plot(sil, main="Silhouette plot - Chord - Ward",
     cex.names=0.8, col=2:(k+1), nmax=100)

K = 3 doesn’t work as bad, and just one value was misclassified. We are going to carry on doing our analysis with k = 3 since the results can be more self-explanatory and more adjusted to the theory.

hcoplot <- function(tree, diss, k, 
                    title = paste("Reordered dendrogram from", deparse(tree$call), sep = "\n"))
{
  require(gclus)
  gr <- cutree(tree, k = k)
  tor <- reorder.hclust(tree, diss)
  plot(tor, hang = -1, xlab = paste(length(gr),"sites"), sub = paste(k, "clusters"), 
       main = title)
  so <- gr[tor$order]
  gro <- numeric(k)
  for (i in 1:k)
  {
    gro[i] <- so[1]
    if (i<k) so <- so[so != gro[i]]
  }
  rect.hclust(tor, k = k, border = gro + 1, cluster = gr)
  legend("topright", paste("Cluster", 1:k), pch = 22, col = 2:(k + 1), bty = "n")
}
hcoplot(tumor.ch.upgma, tumor.ch, k = 3)

4.2. Non-Hierarchical Classification (NHC)

4.2.1. Model creation

We will start creating our model with k = 3 just like the previous HC model.

set.seed(1)
(tumor.kmeans <- kmeans(tumor.norm, centers = 3, nstart = 100))

## K-means clustering with 3 clusters of sizes 18, 6, 18
## 
## Cluster means:
##   extent_of_resection    surgery mgmt_methylation survival_days
## 1          0.03788469 0.03480502       0.05284943     0.9054084
## 2          0.06744738 0.06731479       0.08878770     0.6278674
## 3          0.04911728 0.04808323       0.07153485     0.8193016
##   egfr_amplification initial_kps age_in_years
## 1         0.05035501   0.3200954    0.2550330
## 2         0.08886701   0.5792539    0.4868995
## 3         0.06319629   0.4369001    0.3465407
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  1  1  2  2  3  3  1  1  3  1  1  1  3  1  2  3  3  1  1  1  3  3  3  3  1 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
##  3  1  2  1  3  3  1  1  3  3  3  2  3  1  3  1  2 
## 
## Within cluster sum of squares by cluster:
## [1] 0.08563442 0.04989224 0.07086946
##  (between_SS / total_SS =  82.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

tumor.KM.cascade <- cascadeKM(tumor.norm, inf.gr = 2, sup.gr = 10, iter = 100,
                              criterion = "ssi")
plot(tumor.KM.cascade, sortg = TRUE)

We can explain up to a 82.2% of the patients’ variance. At this point we must ask ourselves if we can improve our model changing the number of clusters. Luckily we can observe this writing a for loop for the different possible number of clusters.

asw <- numeric(nrow(tumor3))
for (k in 2:(nrow(tumor3)-1))
  asw[k] <- pam(tumor.ch, k, diss=TRUE)$silinfo$avg.width
k.best <- which.max(asw)
cat("", "Silhouette-optimal number of clusters k =", k.best, "\n",
    "with an average silhouette width of", max(asw), "\n")

##  Silhouette-optimal number of clusters k = 2 
##  with an average silhouette width of 0.4778439

plot(1:nrow(tumor3), asw, type="h", main="Choice of the number of clusters",
     xlab="k (number of clusters)", ylab="Average silhouette width")
axis(1, k.best, paste("optimum", k.best, sep="\n"), col="red", font=2,
     col.axis="red")
points(k.best, max(asw), pch=16, col="red", cex=1.5)

As in the previous analysis, the ideal number of clusters suggested by the silhouette plot is somewhere between 2 and 4.

4.2.2. Model plot

clusplot(tumor.norm, tumor.kmeans$cluster, color= TRUE, shade = TRUE, labels = 2,
         lines = 0)

There are three well defined groups. This model can explain up to a 88% of the variability, so we have a satisfying model, even better than the previous NHC model.

4.2.3. Exploring clusters

There are a lot of analysis we can make through calculating the differences between clusters. As an example, let’s explore the difference in survival days between the three patients groups:

group1 <- tumor3[tumor.kmeans$cluster == 1, ]
group2 <- tumor3[tumor.kmeans$cluster == 2, ]
group3 <- tumor3[tumor.kmeans$cluster == 3, ]

summary(group1[,6:10]); summary(group2[,6:10]); summary(group3[,6:10])

##  mgmt_methylation survival_days    egfr_amplification  initial_kps    
##  Min.   :1.000    Min.   : 442.0   Min.   :1.000      Min.   : 60.00  
##  1st Qu.:2.000    1st Qu.: 540.5   1st Qu.:2.000      1st Qu.: 80.00  
##  Median :2.000    Median : 624.0   Median :2.000      Median : 90.00  
##  Mean   :2.389    Mean   : 730.3   Mean   :2.167      Mean   : 85.56  
##  3rd Qu.:3.000    3rd Qu.: 897.2   3rd Qu.:3.000      3rd Qu.: 90.00  
##  Max.   :3.000    Max.   :1293.0   Max.   :3.000      Max.   :100.00  
##   age_in_years  
##  Min.   :17.00  
##  1st Qu.:50.25  
##  Median :57.00  
##  Mean   :54.72  
##  3rd Qu.:63.75  
##  Max.   :72.00

##  mgmt_methylation survival_days   egfr_amplification  initial_kps    
##  Min.   :2        Min.   : 62.0   Min.   :1.000      Min.   : 70.00  
##  1st Qu.:2        1st Qu.: 74.0   1st Qu.:1.250      1st Qu.: 75.00  
##  Median :2        Median : 92.5   Median :2.500      Median : 90.00  
##  Mean   :2        Mean   :108.0   Mean   :2.167      Mean   : 86.67  
##  3rd Qu.:2        3rd Qu.:135.0   3rd Qu.:3.000      3rd Qu.: 97.50  
##  Max.   :2        Max.   :184.0   Max.   :3.000      Max.   :100.00  
##   age_in_years  
##  Min.   :50.00  
##  1st Qu.:58.00  
##  Median :62.50  
##  Mean   :60.33  
##  3rd Qu.:64.00  
##  Max.   :66.00

##  mgmt_methylation survival_days   egfr_amplification  initial_kps    
##  Min.   :1.000    Min.   :197.0   Min.   :1          Min.   : 70.00  
##  1st Qu.:2.000    1st Qu.:291.4   1st Qu.:1          1st Qu.: 90.00  
##  Median :2.500    Median :316.8   Median :2          Median : 90.00  
##  Mean   :2.444    Mean   :323.3   Mean   :2          Mean   : 90.56  
##  3rd Qu.:3.000    3rd Qu.:360.5   3rd Qu.:3          3rd Qu.:100.00  
##  Max.   :3.000    Max.   :437.0   Max.   :3          Max.   :100.00  
##   age_in_years  
##  Min.   :26.00  
##  1st Qu.:52.50  
##  Median :59.50  
##  Mean   :57.56  
##  3rd Qu.:64.00  
##  Max.   :76.00

Based on the results of this evaluation we can make the following conclusions:

Presence of MGMT is related to a long-term survival.
Age has a weak relation with long-term survival.
No relationship observed between initial KPS and glioblastoma evolution.

K-means and dendrogram clustering: Glioblastoma cancer clustering

David Valls Lanàquera