Our Goal

This is an attempt to cluster the AGA IMIBD abstracts from DDW2020, to determine if the currently active descriptors capture what is being submitted, or if there are neglected clusters that need new descriptors.

We will approach this from two directions

Using the textmineR package, based on the example of clustering NIH abstracts found here: https://cran.r-project.org/web/packages/textmineR/vignettes/b_document_clustering.html
Using the tidytext package, using the guidance in chapter 6. Topic Modeling here: https://www.tidytextmining.com/topicmodeling.html

First we will read in the data from an excel file provided by a member of the AGA staff, Sierra Avil.

Read in Data, Take a Quick Look

abstracts_raw <- read_excel("2020 IMIBD Abstracts_Sessioned and Scored.xlsx") %>%  
  clean_names()
abstracts_raw$temp_num <- 1:938
names(abstracts_raw) <- c("section", "descriptor", "status", "title", "body", "temp_num")

#translate to short descriptor names
short_names <- c("AnimalModels", "Diarrhea", "AdvEvents", "ComparEff", "CrtlTrials",
                 "Cytokines", "Diagnostics",
                 "ActivityAssess", "Complications", "Epi", "Genomics", "InnAdapt",
                 "Microbiome", "NatHx", "Quality", "Psychosocial", "SpecPop",
                 "TherapMon", "StemCells", "UncontrBiol", "UncontrNonBiol", "Fibrosis",
                 "MicrobialRx", "MucInnate", "NonImm", "PedIBD", "GutMicrobiome",
                 "ViralProk")
abstracts_raw %>% 
  distinct(descriptor) %>% 
  bind_cols(short_names) %>% 
  purrr::set_names("descriptor", "short_name") ->
merge_table

## New names:
## * NA -> ...2

abstracts_raw <- abstracts_raw %>% 
  left_join(merge_table)

## Joining, by = "descriptor"

abstracts_raw$id <- paste0(abstracts_raw$temp_num, "_", abstracts_raw$short_name)

Table by Acceptance

How many abstracts do we have, and how many accepted (Sessioned)?

abstracts_raw %>% 
  tabyl(status) %>% 
  adorn_pct_formatting() %>% 
  gt()

status	n	percent
Scored	255	27.2%
Sessioned	683	72.8%

There are a total of 938 abstracts submitted, with 683 (72.8%) sessioned and 27.2% rejected.

Variance by Descriptor

Does this vary by descriptor?
Let’s take a look.

abstracts_raw %>% 
  tabyl(short_name, status) %>% 
  adorn_totals() %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_ns() %>% 
  gt()

short_name	Scored	Sessioned
ActivityAssess	9.1% (5)	90.9% (50)
AdvEvents	35.7% (15)	64.3% (27)
AnimalModels	7.1% (1)	92.9% (13)
ComparEff	21.7% (5)	78.3% (18)
Complications	41.5% (27)	58.5% (38)
CrtlTrials	14.1% (9)	85.9% (55)
Cytokines	9.1% (3)	90.9% (30)
Diagnostics	28.2% (11)	71.8% (28)
Diarrhea	23.5% (4)	76.5% (13)
Epi	53.2% (41)	46.8% (36)
Fibrosis	11.1% (2)	88.9% (16)
Genomics	7.1% (2)	92.9% (26)
GutMicrobiome	23.1% (3)	76.9% (10)
InnAdapt	18.2% (2)	81.8% (9)
MicrobialRx	0.0% (0)	100.0% (6)
Microbiome	11.1% (3)	88.9% (24)
MucInnate	0.0% (0)	100.0% (8)
NatHx	13.6% (9)	86.4% (57)
NonImm	0.0% (0)	100.0% (5)
PedIBD	0.0% (0)	100.0% (5)
Psychosocial	20.8% (10)	79.2% (38)
Quality	34.7% (17)	65.3% (32)
SpecPop	12.8% (6)	87.2% (41)
StemCells	22.2% (2)	77.8% (7)
TherapMon	38.5% (25)	61.5% (40)
UncontrBiol	56.0% (42)	44.0% (33)
UncontrNonBiol	39.3% (11)	60.7% (17)
ViralProk	0.0% (0)	100.0% (1)
Total	27.2% (255)	72.8% (683)

Seems possible.
Let’s check this out with a fisher exact test.

abstracts_raw %>% 
  tabyl(short_name, status) %>% 
  fisher.test(simulate.p.value = TRUE)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  .
## p-value = 0.0004998
## alternative hypothesis: two.sided

Looks like there is some significant variance between descriptors.

Some standardization of scoring between descriptors might help.
To be addressed later.

Clustering with {textmineR}

We would like to cluster our abstracts - are there truly 28 clusters? Or are we missing some? Or should some be merged?

Let’s take a look.

# create a document term matrix 
dtm <- CreateDtm(doc_vec = abstracts_raw$body, # character vector of documents
        doc_names = abstracts_raw$id, # document names
        ngram_window = c(1, 2), # minimum and maximum n-gram length
        stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
        stopwords::stopwords(source = "smart")), # this is the default value
        lower = TRUE, # lowercase - this is the default value
        remove_punctuation = TRUE, # punctuation - this is the default
        remove_numbers = TRUE, # numbers - this is the default
        verbose = FALSE, # Turn off status bar for this demo
        cpus = 2) # default is all available cpus on the system

# construct the matrix of term counts to get the IDF vector
tf_mat <- TermDocFreq(dtm)

# calculate TF-IDF
# TF-IDF and cosine similarity
tfidf <- t(dtm[ , tf_mat$term ]) * tf_mat$idf #transpose first
tfidf <- t(tfidf)

# calculate cosine similarity, convert to a distance
csim <- tfidf / sqrt(rowSums(tfidf * tfidf))
csim <- csim %*% t(csim)

# convert matrix to a dist object
cdist <- as.dist(1 - csim)

# clustering with Ward's method
# lots of methods to choose from
hc <- hclust(cdist, "ward.D")

# cut to 25 clusters
clustering <- cutree(hc, 30)

An overplotted plot of clusters

# now plot
plot(hc, main = "Hierarchical clustering of 938 abstracts submitted to AGA IMIBD Section for DDW2020",
     ylab = "", xlab = "", yaxt = "n")

rect.hclust(hc, 30, border = "red")

# see what is in clusters
# probability difference calculation
p_words <- colSums(dtm) / sum(dtm)

cluster_words <- lapply(unique(clustering), function(x){
  rows <- dtm[ clustering == x , ]
  
  # for memory's sake, drop all words that don't appear in the cluster
  rows <- rows[ , colSums(rows) > 0 ]
  
  colSums(rows) / sum(rows) - p_words[ colnames(rows) ]
})

Summary Table of clusters and top5 words in each

# make a summary table
# create a summary table of the top 5 words defining each cluster
cluster_summary <- data.frame(cluster = unique(clustering),
                              size = as.numeric(table(clustering)),
                              top_words = sapply(cluster_words, function(d){
                                paste(
                                  names(d)[ order(d, decreasing = TRUE) ][ 1:5 ], 
                                  collapse = ", ")
                              }),
                              stringsAsFactors = FALSE)
cluster_summary %>% 
  gt()

cluster	size	top_words
1	144	mice, il, cells, expression, cell
2	113	fc, levels, endoscopic, responders, ml
3	362	ibd, patients, risk, years, disease
4	9	cdi, fmt, naat, tnf, anti
5	9	pouch, pouchitis, pre, pre_pouch, ileitis
6	28	anti, ada, tnf, anti_tnf, drug
7	4	frailty, ibd, ibd_patients, patients, frail
8	33	ifx, tdm, drug, levels, infliximab
9	50	response, week, clinical, remission, patients
10	7	eim, eims, ibd, colectomy, patients_eim
11	13	vdz, cd, clinical, therapy, months
12	11	sc, wk, vdz, iv, week
13	13	tofacitinib, octave, bid, mg, mg_bid
14	8	miri, mg, urgency, week, miri_mg
15	12	ust, pts, wk, maintenance, induction
16	14	pts, mg, upa, response, bl
17	11	relapse, patients, mes, remission, histologic
18	8	ius, bwt, activity, cd, mm
19	22	pregnancy, ibd, women, delivery, disease
20	6	ehi, endoscopic, concentration, ust, ml
21	3	hiv, patients, ibd, hiv_patients, patients_hiv
22	6	readmission, hospital, days, admission, ibd
23	6	anemia, iron, patients, costs, deficiency
24	6	psc, ibd, patients, aih, psc_ibd
25	10	opioid, ibd, cannabis, marijuana, patients
26	10	ibd, vte, patients, patients_ibd, inpatient
27	5	sleep, quality, ileostomies, deep_sleep, deep
28	4	remission, flare, patients, symptoms, reported
29	7	depression, anxiety, depression_anxiety, anxiety_depression, disease
30	4	ml, ug, ug_ml, ifx, ml_ug

Now look at word clouds for each cluster

# plot a word cloud of each cluster
for (i in 1:dim(cluster_summary)[1]){

print(paste("cluster ", i))
wordcloud::wordcloud(words = names(cluster_words[[ i]]), 
                     freq = cluster_words[[ i ]], 
                     max.words = 50, 
                     random.order = FALSE, 
                     colors = c("red", "purple", "blue"),
                     main = "Top words in cluster 100")
}

## [1] "cluster  1"

## [1] "cluster  2"

## [1] "cluster  3"

## [1] "cluster  4"

## [1] "cluster  5"

## [1] "cluster  6"

## [1] "cluster  7"

## [1] "cluster  8"

## [1] "cluster  9"

## [1] "cluster  10"

## [1] "cluster  11"

## [1] "cluster  12"

## [1] "cluster  13"

## [1] "cluster  14"

## [1] "cluster  15"

## [1] "cluster  16"

## [1] "cluster  17"

## [1] "cluster  18"

## [1] "cluster  19"

## [1] "cluster  20"

## [1] "cluster  21"

## [1] "cluster  22"

## [1] "cluster  23"

## [1] "cluster  24"

## [1] "cluster  25"

## [1] "cluster  26"

## [1] "cluster  27"

## [1] "cluster  28"

## [1] "cluster  29"

## [1] "cluster  30"

AGA Abstract Clustering

Peter Higgins

6/30/2020