Our Goal

This is an attempt to cluster the AGA IMIBD abstracts from DDW2020, to determine if the currently active descriptors capture what is being submitted, or if there are neglected clusters that need new descriptors.

We will approach this from two directions

  1. Using the textmineR package, based on the example of clustering NIH abstracts found here: https://cran.r-project.org/web/packages/textmineR/vignettes/b_document_clustering.html
  2. Using the tidytext package, using the guidance in chapter 6. Topic Modeling here: https://www.tidytextmining.com/topicmodeling.html

First we will read in the data from an excel file provided by a member of the AGA staff, Sierra Avil.

Read in Data, Take a Quick Look

abstracts_raw <- read_excel("2020 IMIBD Abstracts_Sessioned and Scored.xlsx") %>%  
  clean_names()
abstracts_raw$temp_num <- 1:938
names(abstracts_raw) <- c("section", "descriptor", "status", "title", "body", "temp_num")

#translate to short descriptor names
short_names <- c("AnimalModels", "Diarrhea", "AdvEvents", "ComparEff", "CrtlTrials",
                 "Cytokines", "Diagnostics",
                 "ActivityAssess", "Complications", "Epi", "Genomics", "InnAdapt",
                 "Microbiome", "NatHx", "Quality", "Psychosocial", "SpecPop",
                 "TherapMon", "StemCells", "UncontrBiol", "UncontrNonBiol", "Fibrosis",
                 "MicrobialRx", "MucInnate", "NonImm", "PedIBD", "GutMicrobiome",
                 "ViralProk")
abstracts_raw %>% 
  distinct(descriptor) %>% 
  bind_cols(short_names) %>% 
  purrr::set_names("descriptor", "short_name") ->
merge_table
## New names:
## * NA -> ...2
abstracts_raw <- abstracts_raw %>% 
  left_join(merge_table) 
## Joining, by = "descriptor"
abstracts_raw$id <- paste0(abstracts_raw$temp_num, "_", abstracts_raw$short_name)

Table by Acceptance

How many abstracts do we have, and how many accepted (Sessioned)?

abstracts_raw %>% 
  tabyl(status) %>% 
  adorn_pct_formatting() %>% 
  gt()
status n percent
Scored 255 27.2%
Sessioned 683 72.8%

There are a total of 938 abstracts submitted, with 683 (72.8%) sessioned and 27.2% rejected.

Variance by Descriptor

Does this vary by descriptor?
Let’s take a look.

abstracts_raw %>% 
  tabyl(short_name, status) %>% 
  adorn_totals() %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_ns() %>% 
  gt()

short_name Scored Sessioned
ActivityAssess 9.1% (5) 90.9% (50)
AdvEvents 35.7% (15) 64.3% (27)
AnimalModels 7.1% (1) 92.9% (13)
ComparEff 21.7% (5) 78.3% (18)
Complications 41.5% (27) 58.5% (38)
CrtlTrials 14.1% (9) 85.9% (55)
Cytokines 9.1% (3) 90.9% (30)
Diagnostics 28.2% (11) 71.8% (28)
Diarrhea 23.5% (4) 76.5% (13)
Epi 53.2% (41) 46.8% (36)
Fibrosis 11.1% (2) 88.9% (16)
Genomics 7.1% (2) 92.9% (26)
GutMicrobiome 23.1% (3) 76.9% (10)
InnAdapt 18.2% (2) 81.8% (9)
MicrobialRx 0.0% (0) 100.0% (6)
Microbiome 11.1% (3) 88.9% (24)
MucInnate 0.0% (0) 100.0% (8)
NatHx 13.6% (9) 86.4% (57)
NonImm 0.0% (0) 100.0% (5)
PedIBD 0.0% (0) 100.0% (5)
Psychosocial 20.8% (10) 79.2% (38)
Quality 34.7% (17) 65.3% (32)
SpecPop 12.8% (6) 87.2% (41)
StemCells 22.2% (2) 77.8% (7)
TherapMon 38.5% (25) 61.5% (40)
UncontrBiol 56.0% (42) 44.0% (33)
UncontrNonBiol 39.3% (11) 60.7% (17)
ViralProk 0.0% (0) 100.0% (1)
Total 27.2% (255) 72.8% (683)
Seems possible.
Let’s check this out with a fisher exact test.

abstracts_raw %>% 
  tabyl(short_name, status) %>% 
  fisher.test(simulate.p.value = TRUE) 
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  .
## p-value = 0.0004998
## alternative hypothesis: two.sided

Looks like there is some significant variance between descriptors.

Some standardization of scoring between descriptors might help.
To be addressed later.

Clustering with {textmineR}

We would like to cluster our abstracts - are there truly 28 clusters? Or are we missing some? Or should some be merged?

Let’s take a look.

# create a document term matrix 
dtm <- CreateDtm(doc_vec = abstracts_raw$body, # character vector of documents
        doc_names = abstracts_raw$id, # document names
        ngram_window = c(1, 2), # minimum and maximum n-gram length
        stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
        stopwords::stopwords(source = "smart")), # this is the default value
        lower = TRUE, # lowercase - this is the default value
        remove_punctuation = TRUE, # punctuation - this is the default
        remove_numbers = TRUE, # numbers - this is the default
        verbose = FALSE, # Turn off status bar for this demo
        cpus = 2) # default is all available cpus on the system
# construct the matrix of term counts to get the IDF vector
tf_mat <- TermDocFreq(dtm)
# calculate TF-IDF
# TF-IDF and cosine similarity
tfidf <- t(dtm[ , tf_mat$term ]) * tf_mat$idf #transpose first
tfidf <- t(tfidf)

# calculate cosine similarity, convert to a distance
csim <- tfidf / sqrt(rowSums(tfidf * tfidf))
csim <- csim %*% t(csim)

# convert matrix to a dist object
cdist <- as.dist(1 - csim)
# clustering with Ward's method
# lots of methods to choose from
hc <- hclust(cdist, "ward.D")

# cut to 25 clusters
clustering <- cutree(hc, 30)

An overplotted plot of clusters

# now plot
plot(hc, main = "Hierarchical clustering of 938 abstracts submitted to AGA IMIBD Section for DDW2020",
     ylab = "", xlab = "", yaxt = "n")

rect.hclust(hc, 30, border = "red")

# see what is in clusters
# probability difference calculation
p_words <- colSums(dtm) / sum(dtm)

cluster_words <- lapply(unique(clustering), function(x){
  rows <- dtm[ clustering == x , ]
  
  # for memory's sake, drop all words that don't appear in the cluster
  rows <- rows[ , colSums(rows) > 0 ]
  
  colSums(rows) / sum(rows) - p_words[ colnames(rows) ]
})

Summary Table of clusters and top5 words in each

# make a summary table
# create a summary table of the top 5 words defining each cluster
cluster_summary <- data.frame(cluster = unique(clustering),
                              size = as.numeric(table(clustering)),
                              top_words = sapply(cluster_words, function(d){
                                paste(
                                  names(d)[ order(d, decreasing = TRUE) ][ 1:5 ], 
                                  collapse = ", ")
                              }),
                              stringsAsFactors = FALSE)
cluster_summary %>% 
  gt()
cluster size top_words
1 144 mice, il, cells, expression, cell
2 113 fc, levels, endoscopic, responders, ml
3 362 ibd, patients, risk, years, disease
4 9 cdi, fmt, naat, tnf, anti
5 9 pouch, pouchitis, pre, pre_pouch, ileitis
6 28 anti, ada, tnf, anti_tnf, drug
7 4 frailty, ibd, ibd_patients, patients, frail
8 33 ifx, tdm, drug, levels, infliximab
9 50 response, week, clinical, remission, patients
10 7 eim, eims, ibd, colectomy, patients_eim
11 13 vdz, cd, clinical, therapy, months
12 11 sc, wk, vdz, iv, week
13 13 tofacitinib, octave, bid, mg, mg_bid
14 8 miri, mg, urgency, week, miri_mg
15 12 ust, pts, wk, maintenance, induction
16 14 pts, mg, upa, response, bl
17 11 relapse, patients, mes, remission, histologic
18 8 ius, bwt, activity, cd, mm
19 22 pregnancy, ibd, women, delivery, disease
20 6 ehi, endoscopic, concentration, ust, ml
21 3 hiv, patients, ibd, hiv_patients, patients_hiv
22 6 readmission, hospital, days, admission, ibd
23 6 anemia, iron, patients, costs, deficiency
24 6 psc, ibd, patients, aih, psc_ibd
25 10 opioid, ibd, cannabis, marijuana, patients
26 10 ibd, vte, patients, patients_ibd, inpatient
27 5 sleep, quality, ileostomies, deep_sleep, deep
28 4 remission, flare, patients, symptoms, reported
29 7 depression, anxiety, depression_anxiety, anxiety_depression, disease
30 4 ml, ug, ug_ml, ifx, ml_ug

Now look at word clouds for each cluster

# plot a word cloud of each cluster
for (i in 1:dim(cluster_summary)[1]){

print(paste("cluster ", i))
wordcloud::wordcloud(words = names(cluster_words[[ i]]), 
                     freq = cluster_words[[ i ]], 
                     max.words = 50, 
                     random.order = FALSE, 
                     colors = c("red", "purple", "blue"),
                     main = "Top words in cluster 100")
}
## [1] "cluster  1"

## [1] "cluster  2"

## [1] "cluster  3"

## [1] "cluster  4"

## [1] "cluster  5"

## [1] "cluster  6"

## [1] "cluster  7"

## [1] "cluster  8"

## [1] "cluster  9"

## [1] "cluster  10"

## [1] "cluster  11"

## [1] "cluster  12"

## [1] "cluster  13"

## [1] "cluster  14"

## [1] "cluster  15"

## [1] "cluster  16"

## [1] "cluster  17"

## [1] "cluster  18"

## [1] "cluster  19"

## [1] "cluster  20"

## [1] "cluster  21"

## [1] "cluster  22"

## [1] "cluster  23"

## [1] "cluster  24"

## [1] "cluster  25"

## [1] "cluster  26"

## [1] "cluster  27"

## [1] "cluster  28"

## [1] "cluster  29"

## [1] "cluster  30"