Introduction

This is an early Exploratory Data Analysis for the Personalized Medicine: Redefining Cancer Treatment challenge. I will be using ggplot2 and the tidyverse tools to study and visualise the structures in the data.

We have been challenged to automatically classify genetic mutations that contribute to cancer tumor growth (so-called “drivers”) in the presence of mutations that are don’t affect the tumors (“passengers”).

The data comes in 4 different files. Two csv files and two text files:

training/test variants: These are csv catalogues of the gene mutations together with the target value Class, which is the (manually) classified assessment of the mutation. The feature variables are Gene, the specific gene where the mutation took place, and Variation, the nature of the mutation. The test data of course doesn’t have the Class values. This is what we have to predict. These two files each are linked through an ID variable to another file each, namely:

training/test text: Those contain an extensive description of the evidence that was used (by experts) to manually label the mutation classes.

train <- read_csv('C:/Users/UDAY/Desktop/workshop/breast cancer/training_variants')
## Parsed with column specification:
## cols(
##   ID = col_integer(),
##   Gene = col_character(),
##   Variation = col_character(),
##   Class = col_integer()
## )
test  <- read_csv('C:/Users/UDAY/Desktop/workshop/breast cancer/test_variants')
## Parsed with column specification:
## cols(
##   ID = col_integer(),
##   Gene = col_character(),
##   Variation = col_character()
## )
train_txt_dump <- tibble(text = read_lines('C:/Users/UDAY/Desktop/workshop/breast cancer/training_text', skip = 1))
train_txt <- train_txt_dump %>%
  separate(text, into = c("ID", "txt"), sep = "\\|\\|")
train_txt <- train_txt %>%
  mutate(ID = as.integer(ID))

test_txt_dump <- tibble(text = read_lines('C:/Users/UDAY/Desktop/workshop/breast cancer/test_text', skip = 1))
test_txt <- test_txt_dump %>%
  separate(text, into = c("ID", "txt"), sep = "\\|\\|")
test_txt <- test_txt %>%
  mutate(ID = as.integer(ID))

First table overviews of the data:

##        ID            Gene                     Variation    Class  
##  Min.   :   0   BRCA1  : 264   Truncating Mutations:  93   1:568  
##  1st Qu.: 830   TP53   : 163   Deletion            :  74   2:452  
##  Median :1660   EGFR   : 141   Amplification       :  71   3: 89  
##  Mean   :1660   PTEN   : 126   Fusions             :  34   4:686  
##  3rd Qu.:2490   BRCA2  : 125   Overexpression      :   6   5:242  
##  Max.   :3320   KIT    :  99   G12V                :   4   6:275  
##                 BRAF   :  93   E17K                :   3   7:953  
##                 ALK    :  69   Q61H                :   3   8: 19  
##                 (Other):2241   (Other)             :3033   9: 37

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

glimpse(train)
## Observations: 3,321
## Variables: 4
## $ ID        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ Gene      <fct> FAM58A, CBL, CBL, CBL, CBL, CBL, CBL, CBL, CBL, CBL,...
## $ Variation <fct> Truncating Mutations, W802*, Q249E, N454D, L399V, V3...
## $ Class     <fct> 1, 2, 2, 3, 4, 4, 5, 1, 4, 4, 4, 4, 4, 4, 5, 4, 1, 4...
nrow(train)
## [1] 3321
nrow(test)
## [1] 5668
sum(is.na(train))
## [1] 0
sum(is.na(test))
## [1] 0
train %>%
  group_by(Gene) %>%
  summarise(ct = n()) %>%
  arrange(desc(ct))
## # A tibble: 264 x 2
##    Gene      ct
##    <fct>  <int>
##  1 BRCA1    264
##  2 TP53     163
##  3 EGFR     141
##  4 PTEN     126
##  5 BRCA2    125
##  6 KIT       99
##  7 BRAF      93
##  8 ALK       69
##  9 ERBB2     69
## 10 PDGFRA    60
## # ... with 254 more rows
test %>%
  group_by(Gene) %>%
  summarise(ct = n()) %>%
  arrange(desc(ct))
## # A tibble: 1,397 x 2
##    Gene     ct
##    <fct> <int>
##  1 F8      134
##  2 CFTR     57
##  3 F9       54
##  4 G6PD     46
##  5 GBA      39
##  6 AR       38
##  7 PAH      38
##  8 CASR     37
##  9 ARSA     30
## 10 BRCA1    29
## # ... with 1,387 more rows
train %>%
  group_by(Variation) %>%
  summarise(ct = n()) %>%
  arrange(desc(ct))
## # A tibble: 2,996 x 2
##    Variation               ct
##    <fct>                <int>
##  1 Truncating Mutations    93
##  2 Deletion                74
##  3 Amplification           71
##  4 Fusions                 34
##  5 Overexpression           6
##  6 G12V                     4
##  7 E17K                     3
##  8 Q61H                     3
##  9 Q61L                     3
## 10 Q61R                     3
## # ... with 2,986 more rows
test %>%
  group_by(Variation) %>%
  summarise(ct = n()) %>%
  arrange(desc(ct))
## # A tibble: 5,628 x 2
##    Variation               ct
##    <fct>                <int>
##  1 Truncating Mutations    18
##  2 Deletion                14
##  3 Amplification            8
##  4 Fusions                  3
##  5 G44D                     2
##  6 A101V                    1
##  7 A1020P                   1
##  8 A1028V                   1
##  9 A1035V                   1
## 10 A1038V                   1
## # ... with 5,618 more rows

There are 3321 different IDs in the training set containing 264 different Gene expressions with 2996 different Variations. There are 9 different Classes indicated by integer levels.

The Gene and Variation features contain character strings of various lengths.

There is 70% more test data than train data. The data description tells us that “Some of the test data is machine-generated to prevent hand labeling.”, which should explain this otherwise curious imbalance.

There are no missing values in the variants data.

The most frequent Genes in the train vs test data are complete different. In addition, the test data seems to contain significantly more different Genes and fewer high-frequency Genes than the train data. To some extent, this might be an effect of the added machine-generate entries in the test data (by adding many different random levels). Thereby, the difference in frequency might mirror the true fraction of effective test data over train data.

In contrast, the most frequent Variations in train vs test are largely identical; although, again, the corresponding frequencies are lower in the test data (by a factor of 5 - 10).

Individual feature visualisations

This is the frequency distribution of the most frequent Gene values:

top_gene <- train %>%
  group_by(Gene) %>%
  summarise(ct = n()) %>%
  filter(ct > 40)

top_gene %>%
  ggplot(aes(reorder(Gene, -ct, FUN = min), ct)) +
  geom_point(size = 4) +
  labs(x = "Gene", y = "Frequency") +
  coord_flip()

top_gene_test <- test %>%
  group_by(Gene) %>%
  summarise(ct = n()) %>%
  filter(ct > 40)

top_gene_test %>%
  ggplot(aes(reorder(Gene, -ct, FUN = min), ct)) +
  geom_point(size = 4) +
  labs(x = "Gene", y = "Frequency") +
  coord_flip()

We find:

A relatively small group of Gene levels make up a sizeable part of the feature values in both train and test data.

The test data has fewer high-frequency Genes.

foo <- train %>% mutate(set = factor("train")) %>% select(-Class, -ID)
bar <- test %>% mutate(set = factor("test")) %>% select(-ID)

foo <- full_join(foo, bar)
## Joining, by = c("Gene", "Variation", "set")
## Warning: Column `Gene` joining factors with different levels, coercing to
## character vector
## Warning: Column `Variation` joining factors with different levels, coercing
## to character vector
## Warning: Column `set` joining factors with different levels, coercing to
## character vector
foo %>%
  group_by(Variation, set) %>%
  summarise(ct = n()) %>%
  filter(ct > 3) %>%
  ggplot(aes(reorder(Variation, -ct, FUN = median), ct, colour = set)) +
  geom_point(size = 4) +
  coord_cartesian(ylim = c(0, 100)) +
  labs(x = "Variation", y = "Frequency")

Here we see how the Class target is distributed in the train data:

train %>%
  ggplot(aes(Class)) +
  geom_bar()

We find:

Class levels 3, 8, and 9 are notably under-represented

Levels 5 and 6 are of comparable, medium-low frequency

Levels 1, 2, and 4 are of comparable, medium-high frequency

Level 7 is clearly the most frequent one

Feature interactions

Now we want to examine how the features interact with each other and with the target Class variable.

Gene vs Class

First, we will look at the frequency distribution of the overall most frequent Genes for the different Classes. Note the logarithmic frequency scale.

train %>%
  filter(Gene %in% str_c(top_gene$Gene)) %>%
  ggplot(aes(Gene)) +
  geom_bar() +
  scale_y_log10() +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=7)) +
  facet_wrap(~ Class)

We see immediately that there are significant differences:

Some Genes, like “PTEN”, are predominatly present in a single Class (here: 4).

Other Genes, like “TP53”, are mainly shared between 2 classes (here: 1 and 4).

Classes 8 and 9 contain none of the most frequent Genes.

Here’s what it looks like for the Classes sorted by Genes (again log counts):

train %>%
  filter(Gene %in% str_c(top_gene$Gene)) %>%
  ggplot(aes(Class)) +
  geom_bar() +
  scale_y_log10() +
  facet_wrap(~ Gene)

This representation underlines our findings about the similar/dominating Genes in different Classes.

Gene vs Variation

Next, we are somewhat repurposing a count plot to visualise how the Variations are distributed for the most frequent Genes. Since there are so many different variations we drop the y-axis labels and merely illustrate how many Gene - Variation combinations exist in the data.

First the training data:

foo <- train %>%
  filter(Gene %in% str_c(top_gene$Gene)) %>%
  group_by(Gene, Variation) %>%
  summarise(ct = n())

y_labels <- str_sub(foo$Variation, start = 1, end = 5)
  
foo %>%
  ggplot(aes(reorder(Gene, ct, FUN = median), reorder(Variation, ct, FUN = median))) +
  geom_count() +
  labs(x = "Gene", y = "Variation") +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=7),
        axis.ticks = element_blank(), axis.text.y = element_blank(),
        legend.position = "none")

Then the test data:

foo <- test %>%
  filter(Gene %in% str_c(top_gene$Gene)) %>%
  group_by(Gene, Variation) %>%
  summarise(ct = n())

y_labels <- str_sub(foo$Variation, start = 1, end = 5)
  
foo %>%
  ggplot(aes(reorder(Gene, ct, FUN = median), reorder(Variation, ct, FUN = median))) +
  geom_count() +
  labs(x = "Gene", y = "Variation") +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=7),
        axis.ticks = element_blank(), axis.text.y = element_blank(),
        legend.position = "none")

The second kind of data files contain a whole lot of text from what looks like scientific papers or proceedings. Here is the beginning of the first entry:

str_sub(train_txt$txt[1], start = 1, end = 1e3)
## [1] "Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells. The precise mechanisms by which CDK10 modulates ETS2 activity, and more generally the functions of CDK10, remain elusive. Here we demonstrate that CDK10 is a cyclin-dependent kinase by identifying cyclin M as an activating cyclin. Cyclin M, an orphan cyclin, is the product of FAM58A, whose mutations cause STAR syndrome, a human developmental anomaly whose features include toe syndactyly, telecanthus, and anogenital and renal malformations. We show that STAR syndrome-associated cyclin M mutants are unable to interact with CDK10. Cyclin M silencing phenocopies CDK10"

Text length - txt_len

train_txt <- train_txt %>%
  mutate(txt_len = str_length(txt),
         set = "train")

test_txt <- test_txt %>%
  mutate(txt_len = str_length(txt),
         set = "test")

combine_txt <- full_join(train_txt,test_txt)
## Joining, by = c("ID", "txt", "txt_len", "set")

First, here is the overall distribution of the text entry lengths in train vs test:

combine_txt %>%
  ggplot(aes(txt_len, fill = set)) +
#  geom_density(alpha = 0.5, bw = 5e3) +
  geom_histogram(bins = 50) +
  labs(x = "Length of text entry")

Now, let’s see whether this distribution changes for the different target Classes. First, a facet wrap comparison:

foo <- train_txt %>%
  select(ID, txt_len)
bar <- train %>%
  select(ID, Class)

full_join(foo, bar, by = "ID") %>%
  ggplot(aes(txt_len)) +
  geom_density(fill = "red", bw = 5e3) +
  labs(x = "Length of text entry") +
  facet_wrap(~ Class)

Then an overlay of empirical cumulative density functions:

foo <- train_txt %>%
  select(ID, txt_len)
bar <- train %>%
  select(ID, Class)

full_join(foo, bar, by = "ID") %>%
  ggplot(aes(txt_len)) +
  stat_ecdf(geom = "step") +
  stat_ecdf(aes(txt_len, color = Class), geom = "step") +
  labs(x = "Length of text entry")

And the median lengths for each class:

foo <- train_txt %>%
  select(ID, txt_len)
bar <- train %>%
  select(ID, Class)

full_join(foo, bar, by = "ID") %>%
  group_by(Class) %>%
  summarise(l_med = median(txt_len))
## # A tibble: 9 x 2
##   Class  l_med
##   <fct>  <dbl>
## 1 1     49582.
## 2 2     45434.
## 3 3     36901.
## 4 4     42790.
## 5 5     42863.
## 6 6     46289.
## 7 7     54675.
## 8 8     76146.
## 9 9     73433.

We find:

There appear to be significant differences in the shape and median of the test length distributions. Classes 8 and 9 require on average more text, whereas Class 3 has the shortest/fewest papers associated with it.

For what it’s worth, it is tempting to speculate that the apparent multiple peaks in the text length distributions of the individual Classes could correspond to the number of papers that make up the clinical evidence.

Missing text values

combine_txt %>%
  filter(txt_len < 100)
## # A tibble: 6 x 4
##      ID txt     txt_len set  
##   <int> <chr>     <int> <chr>
## 1  1109 "null "       5 train
## 2  1277 "null "       5 train
## 3  1407 "null "       5 train
## 4  1639 "null "       5 train
## 5  2755 "null "       5 train
## 6  1623 "null "       5 test

We choose the two words “pathogenic” and “benign” that are used in the naming of the 5 categories in this overview paper. Here we extract their frequency of occurence per observation:

train_txt <- train_txt %>%
  mutate(f_pathogenic = str_count(txt, "pathogenic"),
         f_benign = str_count(txt, "benign")
         )

Those are the frequency distributions of the word “pathogenic” for our 9 classes (note the logarithmic y-axes):

foo <- train_txt %>%
  select(ID, f_benign, f_pathogenic)
bar <- train %>%
  select(ID, Class)

full_join(foo, bar, by = "ID") %>%
  ggplot(aes(f_pathogenic)) +
  geom_bar() +
  scale_y_log10() +
#  scale_x_log10() +
  facet_wrap(~ Class)

And here we plot the ratio of the mean occurence per class of the word “pathogenic” over the mean occurence of the word “benign”:

foo <- train_txt %>%
  select(ID, f_benign, f_pathogenic)
bar <- train %>%
  select(ID, Class)

full_join(foo, bar, by = "ID") %>%
  group_by(Class) %>%
  summarise(mean_benign = mean(f_benign),
            mean_pathogenic = mean(f_pathogenic),
            path_ben = mean(f_pathogenic)/mean(f_benign)) %>%
  ggplot(aes(reorder(Class, -path_ben, FUN = max), path_ben)) +
  geom_point(colour = "red", size = 3) +
  labs(x = "Class", y = "# occurences 'pathogenic' / # occurences 'benign'")

We find:

The facet plot shows that the word “pathogenic” is clearly more frequent in certain Classes such as 1, 4, or 5

The ratio plot confirms this impression and suggests two distinct groups of Classes: 2, 7, 8, 9 vs 1, 3, 4. The latter have on average a higher ratio of mentions of “pathogenic” over “benign” than the former. In addition, Classes 5 and 6 have an even higher ratio of “pathogenic” over “benign”.

TExt Analysis

The tidytext package contains a dictionary of stop words, like “and” or “next”, which we can remove from our tidy text data. In addition, we will define our own selection of stop words based on the typical structuring language of scientific papers. We also remove tokens that are only numbers or symbols.

t1 <- train_txt %>% select(ID, txt) %>% unnest_tokens(word, txt)
head(t1)
## # A tibble: 6 x 2
##      ID word     
##   <int> <chr>    
## 1     0 cyclin   
## 2     0 dependent
## 3     0 kinases  
## 4     0 cdks     
## 5     0 regulate 
## 6     0 a
data("stop_words")
my_stopwords <- data_frame(word = c(as.character(1:100),
                                    "fig", "figure", "et", "al", "table",
                                    "data", "analysis", "analyze", "study",
                                    "method", "result", "conclusion", "author",
                                    "find", "found", "show", "perform",
                                    "demonstrate", "evaluate", "discuss"))
t1 <- t1 %>%
  anti_join(stop_words, by = "word") %>%
  anti_join(my_stopwords, by = "word") %>%
  filter(str_detect(word, "[a-z]"))

For a first overview, we have a look at the overall most popular words and their frequencies. This is our first serious application of tidyverse and ggplot2 tools to text data

t1 %>%
  count(word) %>%
  filter(n > 5e4) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

As far as I can see, tidytext has currently no native stemming function. Therefore, we will use the “SnowballC” package and its “wordStem” tool:

t1 <- t1 %>%
  mutate(word = wordStem(word))

t1 %>%
  count(word) %>%
  filter(n > 5e4) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

The result shows us the fundamental words that are most frequent in our overall text data. Another way of visualising these frequencies is through a wordcloud. Personally, I suspect that wordclouds might be the text equivalent of pie charts. But it’s useful to know how to incorporate them into tidy text analysis:

t1 %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

Class-dependent word frequencies

In order to use these word frequencies for prediction we first need to determine them for the individual Classes separately. Below, we join our “text” data with the Class information in the “variants” data set. Afterwards, we determine the relative frequency by Class of each word.

In this example, we will compare Class == 7, the most frequent one, with Classes 1 and 2. Also, we will only look at words with more than 1000 occurences per Class to keep an overview. Here the ability to use dplyr tools starts to pay off properly:

foo <- train %>%
  select(ID, Class)

t1_class <- full_join(t1, foo, by = "ID")

frequency <-t1_class %>%
  count(Class, word) %>%
  filter(n > 5e2) %>%
  group_by(Class) %>%
  mutate(freq = n / sum(n)) %>% 
  select(-n) %>% 
  spread(Class, freq) %>% 
  gather(Class, freq, `1`:`2`)

Then, for a visual overview, we plot the frequency of the words in Class 7 against the other two Classes (note the logarithmic axes):

ggplot(frequency, aes(x = freq, y = `7`, color = abs(`7` - freq))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.1, height = 0.1) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  #scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray95") +
  facet_wrap(~Class, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Class 7", x = NULL)
## Warning: Removed 2355 rows containing missing values (geom_point).
## Warning: Removed 2355 rows containing missing values (geom_text).

The plots give us a useful overview. For instance, they suggest that Classes 2 and 7 are more similar than 1 and 7. For a more systematic approach we compute the correlation coefficients for each frequency set (this time for the full lists, not just above 1000 occurences):

frequency <-t1_class %>%
  count(Class, word) %>%
  #filter(n > 1e3) %>%
  group_by(Class) %>%
  mutate(freq = n / sum(n)) %>% 
  select(-n) %>% 
  spread(Class, freq)

frequency %>%
  select(-word) %>%
  cor(use="complete.obs", method="spearman") %>%
  corrplot(type="lower", method="number", diag=FALSE)

We find:

Classes 2 and 7 are in fact the most similar ones here, followed by 1 and 4 (correlation coefficients above 0.9)

Overall, the most different Class appears to be number 9, in particular compared to classes 3 and 5 (which are not overwhelming similar to each other). Let’s see what word frequency spread looks like for those combinations:

foo <- train %>%
  select(ID, Class)

t1_class <- full_join(t1, foo, by = "ID")

frequency <-t1_class %>%
  count(Class, word) %>%
  filter(n > 2e1) %>%
  group_by(Class) %>%
  mutate(freq = n / sum(n)) %>% 
  select(-n) %>% 
  spread(Class, freq) %>% 
  gather(Class, freq, `3`,`5`)

ggplot(frequency, aes(x = freq, y = `9`, color = abs(`9` - freq))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.1, height = 0.1) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  #scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray95") +
  facet_wrap(~Class, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Class 9", x = NULL)
## Warning: Removed 32835 rows containing missing values (geom_point).
## Warning: Removed 32835 rows containing missing values (geom_text).

We find:

There is significantly more of a scatter than in the previous set of plots; especially for Class 5 vs 9.

Interestingly, both “benign” and “pathogen” are more frequent in Class 3 vs 9.

TF-IDF analysis

Tidytext has the function bind_tf_idf to extract these metrics from a tidy data set that contains words and their counts per Class:

frequency <-t1_class %>%
  count(Class, word)

tf_idf <- frequency %>%
  bind_tf_idf(word, Class, n)

Let’s visualise the most characteristic words and their Class:

tf_idf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  top_n(20, tf_idf) %>%
  ggplot(aes(word, tf_idf, fill = Class)) +
  geom_col() +
  labs(x = NULL, y = "tf-idf") +
  coord_flip()

Well, that looks sufficiently technical I suppose. A quick google search reveals that “dnmt3b7” is in fact an “aberrant splice form of a DNA methyltransferase, DNMT3B7, expressed in virtually all cancer cell lines but at very low levels in normal cells.” (citation). Here it seems to be associated to Class 8.

Let’s have an overview of the most characteristic terms in each individual Class:

tf_idf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(Class) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%  
  ggplot(aes(word, tf_idf, fill = Class)) +
  geom_col() +
  labs(x = NULL, y = "tf-idf") +
  theme(legend.position = "none") +
  facet_wrap(~ Class, ncol = 3, scales = "free") +
  coord_flip()