This is an attempt to cluster the AGA IMIBD abstracts from DDW2020, to determine if the currently active descriptors capture what is being submitted, or if there are neglected clusters that need new descriptors.
We will approach this from two directions
First we will read in the data from an excel file provided by a member of the AGA staff, Sierra Avil.
abstracts_raw <- read_excel("2020 IMIBD Abstracts_Sessioned and Scored.xlsx") %>%
clean_names()
abstracts_raw$temp_num <- 1:938
names(abstracts_raw) <- c("section", "descriptor", "status", "title", "body", "temp_num")
#translate to short descriptor names
short_names <- c("AnimalModels", "Diarrhea", "AdvEvents", "ComparEff", "CrtlTrials",
"Cytokines", "Diagnostics",
"ActivityAssess", "Complications", "Epi", "Genomics", "InnAdapt",
"Microbiome", "NatHx", "Quality", "Psychosocial", "SpecPop",
"TherapMon", "StemCells", "UncontrBiol", "UncontrNonBiol", "Fibrosis",
"MicrobialRx", "MucInnate", "NonImm", "PedIBD", "GutMicrobiome",
"ViralProk")
abstracts_raw %>%
distinct(descriptor) %>%
bind_cols(short_names) %>%
purrr::set_names("descriptor", "short_name") ->
merge_table
## New names:
## * NA -> ...2
abstracts_raw <- abstracts_raw %>%
left_join(merge_table)
## Joining, by = "descriptor"
abstracts_raw$id <- paste0(abstracts_raw$temp_num, "_", abstracts_raw$short_name)
How many abstracts do we have, and how many accepted (Sessioned)?
abstracts_raw %>%
tabyl(status) %>%
adorn_pct_formatting() %>%
gt()
| status | n | percent |
|---|---|---|
| Scored | 255 | 27.2% |
| Sessioned | 683 | 72.8% |
There are a total of 938 abstracts submitted, with 683 (72.8%) sessioned and 27.2% rejected.
Does this vary by descriptor?
Let’s take a look.
abstracts_raw %>%
tabyl(short_name, status) %>%
adorn_totals() %>%
adorn_percentages() %>%
adorn_pct_formatting() %>%
adorn_ns() %>%
gt()
| short_name | Scored | Sessioned |
|---|---|---|
| ActivityAssess | 9.1% (5) | 90.9% (50) |
| AdvEvents | 35.7% (15) | 64.3% (27) |
| AnimalModels | 7.1% (1) | 92.9% (13) |
| ComparEff | 21.7% (5) | 78.3% (18) |
| Complications | 41.5% (27) | 58.5% (38) |
| CrtlTrials | 14.1% (9) | 85.9% (55) |
| Cytokines | 9.1% (3) | 90.9% (30) |
| Diagnostics | 28.2% (11) | 71.8% (28) |
| Diarrhea | 23.5% (4) | 76.5% (13) |
| Epi | 53.2% (41) | 46.8% (36) |
| Fibrosis | 11.1% (2) | 88.9% (16) |
| Genomics | 7.1% (2) | 92.9% (26) |
| GutMicrobiome | 23.1% (3) | 76.9% (10) |
| InnAdapt | 18.2% (2) | 81.8% (9) |
| MicrobialRx | 0.0% (0) | 100.0% (6) |
| Microbiome | 11.1% (3) | 88.9% (24) |
| MucInnate | 0.0% (0) | 100.0% (8) |
| NatHx | 13.6% (9) | 86.4% (57) |
| NonImm | 0.0% (0) | 100.0% (5) |
| PedIBD | 0.0% (0) | 100.0% (5) |
| Psychosocial | 20.8% (10) | 79.2% (38) |
| Quality | 34.7% (17) | 65.3% (32) |
| SpecPop | 12.8% (6) | 87.2% (41) |
| StemCells | 22.2% (2) | 77.8% (7) |
| TherapMon | 38.5% (25) | 61.5% (40) |
| UncontrBiol | 56.0% (42) | 44.0% (33) |
| UncontrNonBiol | 39.3% (11) | 60.7% (17) |
| ViralProk | 0.0% (0) | 100.0% (1) |
| Total | 27.2% (255) | 72.8% (683) |
abstracts_raw %>%
tabyl(short_name, status) %>%
fisher.test(simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: .
## p-value = 0.0004998
## alternative hypothesis: two.sided
Looks like there is some significant variance between descriptors.
Some standardization of scoring between descriptors might help.
To be addressed later.
We would like to cluster our abstracts - are there truly 28 clusters? Or are we missing some? Or should some be merged?
Let’s take a look.
# create a document term matrix
dtm <- CreateDtm(doc_vec = abstracts_raw$body, # character vector of documents
doc_names = abstracts_raw$id, # document names
ngram_window = c(1, 2), # minimum and maximum n-gram length
stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
stopwords::stopwords(source = "smart")), # this is the default value
lower = TRUE, # lowercase - this is the default value
remove_punctuation = TRUE, # punctuation - this is the default
remove_numbers = TRUE, # numbers - this is the default
verbose = FALSE, # Turn off status bar for this demo
cpus = 2) # default is all available cpus on the system
# construct the matrix of term counts to get the IDF vector
tf_mat <- TermDocFreq(dtm)
# calculate TF-IDF
# TF-IDF and cosine similarity
tfidf <- t(dtm[ , tf_mat$term ]) * tf_mat$idf #transpose first
tfidf <- t(tfidf)
# calculate cosine similarity, convert to a distance
csim <- tfidf / sqrt(rowSums(tfidf * tfidf))
csim <- csim %*% t(csim)
# convert matrix to a dist object
cdist <- as.dist(1 - csim)
# clustering with Ward's method
# lots of methods to choose from
hc <- hclust(cdist, "ward.D")
# cut to 25 clusters
clustering <- cutree(hc, 30)
# now plot
plot(hc, main = "Hierarchical clustering of 938 abstracts submitted to AGA IMIBD Section for DDW2020",
ylab = "", xlab = "", yaxt = "n")
rect.hclust(hc, 30, border = "red")
# see what is in clusters
# probability difference calculation
p_words <- colSums(dtm) / sum(dtm)
cluster_words <- lapply(unique(clustering), function(x){
rows <- dtm[ clustering == x , ]
# for memory's sake, drop all words that don't appear in the cluster
rows <- rows[ , colSums(rows) > 0 ]
colSums(rows) / sum(rows) - p_words[ colnames(rows) ]
})
# make a summary table
# create a summary table of the top 5 words defining each cluster
cluster_summary <- data.frame(cluster = unique(clustering),
size = as.numeric(table(clustering)),
top_words = sapply(cluster_words, function(d){
paste(
names(d)[ order(d, decreasing = TRUE) ][ 1:5 ],
collapse = ", ")
}),
stringsAsFactors = FALSE)
cluster_summary %>%
gt()
| cluster | size | top_words |
|---|---|---|
| 1 | 144 | mice, il, cells, expression, cell |
| 2 | 113 | fc, levels, endoscopic, responders, ml |
| 3 | 362 | ibd, patients, risk, years, disease |
| 4 | 9 | cdi, fmt, naat, tnf, anti |
| 5 | 9 | pouch, pouchitis, pre, pre_pouch, ileitis |
| 6 | 28 | anti, ada, tnf, anti_tnf, drug |
| 7 | 4 | frailty, ibd, ibd_patients, patients, frail |
| 8 | 33 | ifx, tdm, drug, levels, infliximab |
| 9 | 50 | response, week, clinical, remission, patients |
| 10 | 7 | eim, eims, ibd, colectomy, patients_eim |
| 11 | 13 | vdz, cd, clinical, therapy, months |
| 12 | 11 | sc, wk, vdz, iv, week |
| 13 | 13 | tofacitinib, octave, bid, mg, mg_bid |
| 14 | 8 | miri, mg, urgency, week, miri_mg |
| 15 | 12 | ust, pts, wk, maintenance, induction |
| 16 | 14 | pts, mg, upa, response, bl |
| 17 | 11 | relapse, patients, mes, remission, histologic |
| 18 | 8 | ius, bwt, activity, cd, mm |
| 19 | 22 | pregnancy, ibd, women, delivery, disease |
| 20 | 6 | ehi, endoscopic, concentration, ust, ml |
| 21 | 3 | hiv, patients, ibd, hiv_patients, patients_hiv |
| 22 | 6 | readmission, hospital, days, admission, ibd |
| 23 | 6 | anemia, iron, patients, costs, deficiency |
| 24 | 6 | psc, ibd, patients, aih, psc_ibd |
| 25 | 10 | opioid, ibd, cannabis, marijuana, patients |
| 26 | 10 | ibd, vte, patients, patients_ibd, inpatient |
| 27 | 5 | sleep, quality, ileostomies, deep_sleep, deep |
| 28 | 4 | remission, flare, patients, symptoms, reported |
| 29 | 7 | depression, anxiety, depression_anxiety, anxiety_depression, disease |
| 30 | 4 | ml, ug, ug_ml, ifx, ml_ug |
# plot a word cloud of each cluster
for (i in 1:dim(cluster_summary)[1]){
print(paste("cluster ", i))
wordcloud::wordcloud(words = names(cluster_words[[ i]]),
freq = cluster_words[[ i ]],
max.words = 50,
random.order = FALSE,
colors = c("red", "purple", "blue"),
main = "Top words in cluster 100")
}
## [1] "cluster 1"
## [1] "cluster 2"
## [1] "cluster 3"
## [1] "cluster 4"
## [1] "cluster 5"
## [1] "cluster 6"
## [1] "cluster 7"
## [1] "cluster 8"
## [1] "cluster 9"
## [1] "cluster 10"
## [1] "cluster 11"
## [1] "cluster 12"
## [1] "cluster 13"
## [1] "cluster 14"
## [1] "cluster 15"
## [1] "cluster 16"
## [1] "cluster 17"
## [1] "cluster 18"
## [1] "cluster 19"
## [1] "cluster 20"
## [1] "cluster 21"
## [1] "cluster 22"
## [1] "cluster 23"
## [1] "cluster 24"
## [1] "cluster 25"
## [1] "cluster 26"
## [1] "cluster 27"
## [1] "cluster 28"
## [1] "cluster 29"
## [1] "cluster 30"