This can be a manual and protracted interative process which may involve using specialised searching services, downloading abstracts, reading and filtering, secondary searching and so on, and may involve sifting many thousands of abstracts.
Often we may just want a rapid overview of the literature to help focus further reviewing.
In this vignette we demonstrate the use of R packages for large scale extraction of abstracts, and analytical techniques for identifying topics or themes in the abstracts.
The vignette is based on a number of R packages:
europepmc
- this is a sophisticated tool which interacts with the PubMedCentral API and provdes access to additional fields.adjutant
- this is a fully fledged package with retrieval and clustering functions. 3.tidytext
- a package for text mining using tidy data principles.Rtsne
- this uses the tSNE algorithm for data reduction and cluster visualisationdbscan
- applies the HDBSCAN algorithm for data clusteringmyScrapers
- wraps some functions built on other packages to automate the search, extraction, and filtering process.We have “hacked” some of the functions in these packages and written additional functions to develop a work flow from searching and retrieval to analysis
europepmc
This is a package which allows searching of EuropePMC via the API.
It can be downloaded from CRAN.
if(!require("europepmc")) install.packages("europepmc")
library(europepmc)
The main function is epmc_search
which allows us to search the site and retrieve abstracts, metadata and citation counts.
We’ll use it with the search term “deep learning” AND “public health”.
head(epmc_search(params$search, limit = 10))
#> # A tibble: 6 x 28
#> id source pmid doi title authorString journalTitle journalVolume
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 3143~ MED 3143~ 10.3~ A De~ Zhang S, Po~ Stud Health~ 264
#> 2 3145~ MED 3145~ 10.1~ Arti~ Patel UK, A~ J Neurol <NA>
#> 3 3092~ MED 3092~ 10.6~ [Art~ Lin SH, Che~ Hu Li Za Zhi 66
#> 4 3116~ MED 3116~ 10.1~ "[Ap~ Uchida M, N~ Sangyo Eise~ <NA>
#> 5 3118~ MED 3118~ 10.1~ Comp~ Soliman M, ~ Epidemics 28
#> 6 PPR9~ PPR <NA> 10.1~ Atro~ Ratul MAR, ~ <NA> <NA>
#> # ... with 20 more variables: pubYear <chr>, journalIssn <chr>,
#> # pageInfo <chr>, pubType <chr>, isOpenAccess <chr>, inEPMC <chr>,
#> # inPMC <chr>, hasPDF <chr>, hasBook <chr>, citedByCount <int>,
#> # hasReferences <chr>, hasTextMinedTerms <chr>,
#> # hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> # hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> # firstPublicationDate <chr>, issue <chr>, pmcid <chr>, hasSuppl <chr>
This doesn’t extract the abstract text or Mesh headings (keywords) - to facilitate this we have wrapped the search function, into get_full_search
in myScrapers
.
library(tictoc)
tic()
search1 <- get_full_search(search = params$search, limit = params$limit)
toc()
#> 254.51 sec elapsed
head(search1, 20)
#> # A tibble: 20 x 32
#> id source pmid doi title authorString journalTitle journalVolume
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 3143~ MED 3143~ 10.3~ A De~ Zhang S, Po~ Stud Health~ 264
#> 2 3145~ MED 3145~ 10.1~ Arti~ Patel UK, A~ J Neurol <NA>
#> 3 3092~ MED 3092~ 10.6~ [Art~ Lin SH, Che~ Hu Li Za Zhi 66
#> 4 3116~ MED 3116~ 10.1~ "[Ap~ Uchida M, N~ Sangyo Eise~ <NA>
#> 5 3118~ MED 3118~ 10.1~ Comp~ Soliman M, ~ Epidemics 28
#> 6 PPR9~ PPR <NA> 10.1~ Atro~ Ratul MAR, ~ <NA> <NA>
#> 7 3114~ MED 3114~ 10.3~ The ~ Cheon S, Ki~ Int J Envir~ 16
#> 8 3141~ MED 3141~ 10.1~ Sate~ Bruzelius E~ J Am Med In~ 26
#> 9 3112~ MED 3112~ 10.1~ Deep~ Khalighifar~ J Med Entom~ <NA>
#> 10 3114~ MED 3114~ 10.2~ Prom~ Balyen L, P~ Asia Pac J ~ 8
#> 11 3142~ MED 3142~ 10.1~ Auto~ Obeid JS, W~ BMC Med Inf~ 19
#> 12 3127~ MED 3127~ 10.1~ Mach~ Doupe P, Fa~ Value Health 22
#> 13 3119~ MED 3119~ 10.1~ Deep~ Graffy PM, ~ Br J Radiol 92
#> 14 3121~ MED 3121~ 10.3~ Dire~ Qian F, Che~ Int J Envir~ 16
#> 15 3097~ MED 3097~ 10.1~ Auto~ Graffy PM, ~ Abdom Radio~ <NA>
#> 16 3097~ MED 3097~ 10.3~ A De~ Lim J, Kim ~ Int J Envir~ 16
#> 17 3134~ MED 3134~ 10.1~ Erra~ Ruamviboons~ NPJ Digit M~ 2
#> 18 PPR9~ PPR <NA> 10.1~ Deve~ Xu J, Xu K,~ <NA> <NA>
#> 19 3140~ MED 3140~ 10.1~ Stra~ Wong TY, Sa~ Ophthalmolo~ <NA>
#> 20 3080~ MED 3080~ 10.1~ Deep~ Lee SM, Seo~ J Thorac Im~ 34
#> # ... with 24 more variables: pubYear <chr>, journalIssn <chr>,
#> # pageInfo <chr>, pubType <chr>, isOpenAccess <chr>, inEPMC <chr>,
#> # inPMC <chr>, hasPDF <chr>, hasBook <chr>, citedByCount <int>,
#> # hasReferences <chr>, hasTextMinedTerms <chr>,
#> # hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> # hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> # firstPublicationDate <chr>, issue <chr>, pmcid <chr>, hasSuppl <chr>,
#> # name <int>, absText <list>, mesh <list>, keywords <chr>
We can see that the get_full_search
function returns addition metadata such as citation counts, whether the journal is open access and whether there is PDF available. By default, 1000 article descriptions are downloaded. It also includes mesh headings and abstract text.
we can see how many articles are available altogether by running epmc_profile
.
profile <- epmc_profile(query = params$search)
Running epmc_profile
allows us to see that there are 704 articles of which 638 are full text articles, and 489 are open access.
We can easily look at annual abstract frequency - we can readily see the growth in publication frequency in the last 3 years.
search1 %>%
count(pubYear) %>%
ggplot(aes(pubYear, n)) +
geom_col(fill = "blue") +
labs(title = "Abstracts per year",
subtitle = paste("Search: ", params$search)) +
phecharts::theme_phe() +
theme(axis.text.x = element_text(angle = 45 ,hjust = 1))
Similarly we can identify the most frequent journals
journal_count <- search1 %>%
count(journalTitle) %>%
top_n(20) %>%
arrange(-n)
journal_count %>%
ggplot(aes(reorder(journalTitle, n), n)) +
geom_col(fill = "blue") +
coord_flip() +
labs(title = "Journal frequency") +
phecharts::theme_phe()
Int J Environ Res Public Health and PLoS One are the most frequent journals publishing articles on “deep learning” AND “public health”.
Once we have a data frame of 704 records with abstract text, we can prepare the data for analysis. THe create_corpus
function is designed for this.
out1 <- search1 %>%
select(pmid, pmcid ,doi, title, pubYear, citedByCount, absText, journalTitle) %>%
filter(absText != "NULL") %>%
mutate(text = paste(title, absText))
We will use a method exemplified in the adjutant
package which uses unsupervised machine learning to try and cluster similar articles and attach themes.
In this approach undertake some natural language processing. We will
The ultimate output of this analysis is a visualisation of clustered and labelled abstracts and a interactive table.
library(tidytext)
corp <- create_corpus(df = search1)
head(corp$corpus)
#> # A tibble: 6 x 6
#> pmid word n tf idf tf_idf
#> <chr> <chr> <int> <dbl> <dbl> <dbl>
#> 1 10463892 achiev 1 0.00671 1.72 0.0116
#> 2 10463892 admiss 1 0.00671 4.41 0.0296
#> 3 10463892 applic 5 0.0336 1.44 0.0482
#> 4 10463892 assess 1 0.00671 1.52 0.0102
#> 5 10463892 autumn 1 0.00671 6.49 0.0436
#> 6 10463892 bsc 1 0.00671 6.49 0.0436
clust <- create_cluster(corpus = corp$corpus, minPts = 10)
#> 19.33 sec elapsed
clust$cluster_size
#> # A tibble: 14 x 2
#> cluster n
#> <dbl> <int>
#> 1 0 212
#> 2 1 10
#> 3 2 15
#> 4 3 21
#> 5 4 39
#> 6 5 26
#> 7 6 16
#> 8 7 26
#> 9 8 19
#> 10 9 19
#> 11 10 105
#> 12 11 19
#> 13 12 65
#> 14 13 69
labels <- label_clusters(corp$corpus, clustering = clust$clustering, top_n = 4)
#> 0.63 sec elapsed
labels$labels
#> # A tibble: 14 x 2
#> # Groups: cluster [14]
#> cluster clus_names
#> <dbl> <chr>
#> 1 0 data-learn-base-studi
#> 2 1 pollut-qualiti-network-data
#> 3 2 resist-antibiot-antimicrobi-health
#> 4 3 segment-imag-convolut-neural-deep-network-method-perform-result
#> 5 4 genom-identifi-data-studi
#> 6 5 social-health-data-base
#> 7 6 drug-advers-safeti-model-base-studi
#> 8 7 clinic-model-learn-method
#> 9 8 breast-cancer-imag-base-studi
#> 10 9 diabet-retinopathi-screen-imag-patient-learn-base
#> 11 10 model-learn-data-studi
#> 12 11 ai-intellig-artifici-health-data
#> 13 12 data-health-research-develop
#> 14 13 student-educ-learn-studi
p <- labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
ggplot(aes(X1, X2)) +
geom_point(aes(colour = clustered, size = citedByCount) ) +
ggrepel::geom_text_repel(data = labels$plot, aes(medX, medY, label = clus_names), size = 3, colour = "#006d2c", alpha = 0.9)
p + scale_alpha_manual(values=c(1,0)) +
viridis::scale_color_viridis(discrete = TRUE, option = "cividis", alpha = .6) +
phecharts::theme_phe() +
theme(panel.background = element_rect(fill = "#f0f0f0")) +
labs(subtitle = paste("Clustering: ", nrow(labels$plot), " topics" ),
title = paste("Search ", "= ", params$search ))
most_cited <- labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
filter(cluster !=0) %>%
group_by(clus_names) %>%
top_n(n = 3, citedByCount) %>%
select(clus_names, title, pubYear, citedByCount) %>%
ungroup() %>%
arrange(clus_names, -citedByCount)
most_cited %>%
formattable::formattable()
clus_names | title | pubYear | citedByCount |
---|---|---|---|
ai-intellig-artifici-health-data | Artificial intelligence in cancer imaging: Clinical challenges and applications. | 2019 | 4 |
ai-intellig-artifici-health-data | Global Evolution of Research in Artificial Intelligence in Health and Medicine: A Bibliometric Study. | 2019 | 3 |
ai-intellig-artifici-health-data | Cognitive computing and eScience in health and life science research: artificial intelligence and obesity intervention programs. | 2017 | 2 |
breast-cancer-imag-base-studi | Deep learning based tissue analysis predicts outcome in colorectal cancer. | 2018 | 21 |
breast-cancer-imag-base-studi | Antibody-supervised deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin stained breast cancer samples. | 2016 | 12 |
breast-cancer-imag-base-studi | Mammographic density and structural features can individually and jointly contribute to breast cancer risk assessment in mammography screening: a case-control study. | 2016 | 7 |
clinic-model-learn-method | Deep Artificial Neural Networks and Neuromorphic Chips for Big Data Analysis: Pharmaceutical and Bioinformatics Applications. | 2016 | 11 |
clinic-model-learn-method | Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. | 2018 | 8 |
clinic-model-learn-method | EliIE: An open-source information extraction system for clinical trial eligibility criteria. | 2017 | 7 |
data-health-research-develop | Quality collaboratives: lessons from research. | 2002 | 231 |
data-health-research-develop | Building better biomarkers: brain models in translational neuroimaging. | 2017 | 72 |
data-health-research-develop | Making sense of big data in health research: Towards an EU action plan. | 2016 | 44 |
diabet-retinopathi-screen-imag-patient-learn-base | Improved Automated Detection of Diabetic Retinopathy on a Publicly Available Dataset Through Integration of Deep Learning. | 2016 | 48 |
diabet-retinopathi-screen-imag-patient-learn-base | Retinal Imaging Techniques for Diabetic Retinopathy Screening. | 2016 | 9 |
diabet-retinopathi-screen-imag-patient-learn-base | Multi-categorical deep learning neural network to classify retinal images: A pilot study employing small database. | 2017 | 7 |
drug-advers-safeti-model-base-studi | Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. | 2015 | 63 |
drug-advers-safeti-model-base-studi | Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. | 2016 | 17 |
drug-advers-safeti-model-base-studi | Natural Products for Drug Discovery in the 21st Century: Innovations for Novel Drug Discovery. | 2018 | 13 |
genom-identifi-data-studi | Comprehensive functional genomic resource and integrative model for the human brain. | 2018 | 12 |
genom-identifi-data-studi | Pleiotropic Mechanisms Indicated for Sex Differences in Autism. | 2016 | 9 |
genom-identifi-data-studi | Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia, and bipolar disorder. | 2018 | 9 |
model-learn-data-studi | Deep learning for neuroimaging: a validation study. | 2014 | 53 |
model-learn-data-studi | Forecasting influenza in Hong Kong with Google search queries and statistical model fusion. | 2017 | 11 |
model-learn-data-studi | Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. | 2017 | 10 |
pollut-qualiti-network-data | Design of a Mobile Low-Cost Sensor Network Using Urban Buses for Real-Time Ubiquitous Noise Monitoring. | 2016 | 6 |
pollut-qualiti-network-data | A systematic review of data mining and machine learning for air pollution epidemiology. | 2017 | 6 |
pollut-qualiti-network-data | Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation. | 2017 | 5 |
pollut-qualiti-network-data | Towards Personal Exposures: How Technology Is Changing Air Pollution and Health Research. | 2017 | 5 |
resist-antibiot-antimicrobi-health | DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. | 2018 | 14 |
resist-antibiot-antimicrobi-health | Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. | 2018 | 6 |
resist-antibiot-antimicrobi-health | Myxinidin2 and myxinidin3 suppress inflammatory responses through STAT3 and MAPKs to promote wound healing. | 2017 | 4 |
resist-antibiot-antimicrobi-health | Using Machine Learning To Predict Antimicrobial MICs and Associated Genomic Features for Nontyphoidal Salmonella. | 2019 | 4 |
segment-imag-convolut-neural-deep-network-method-perform-result | Urinary bladder segmentation in CT urography using deep-learning convolutional neural network and level sets. | 2016 | 29 |
segment-imag-convolut-neural-deep-network-method-perform-result | ISLES 2015 - A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI. | 2017 | 24 |
segment-imag-convolut-neural-deep-network-method-perform-result | Deep convolutional neural network and 3D deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. | 2018 | 12 |
social-health-data-base | The use of social networking platforms for sexual health promotion: identifying key strategies for successful user engagement. | 2015 | 16 |
social-health-data-base | Researching Mental Health Disorders in the Era of Social Media: Systematic Review. | 2017 | 14 |
social-health-data-base | Characterizing the Discussion of Antibiotics in the Twittersphere: What is the Bigger Picture? | 2015 | 13 |
student-educ-learn-studi | Clinical experience, performance in final examinations, and learning style in medical students: prospective study. | 1998 | 93 |
student-educ-learn-studi | Intercalated degrees, learning styles, and career preferences: prospective longitudinal study of UK medical students. | 1999 | 66 |
student-educ-learn-studi | Randomised controlled trial of clinical decision support tools to improve learning of evidence based medicine in medical students. | 2003 | 62 |
We can review the commonest Mesh headings associated with each cluster tag.
labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
select(clus_names, mesh) %>%
filter(mesh != "NULL") %>%
unnest(mesh) %>%
count(clus_names, mesh,sort = TRUE) %>%
filter(n < 30) %>%
ungroup() %>%
group_by(clus_names) %>%
top_n(10) %>%
mutate(summary = paste(mesh, collapse = "; " )) %>%
select(-c(mesh, n)) %>%
distinct() %>%
arrange(clus_names) %>%
knitr::kable()
clus_names | summary |
---|---|
ai-intellig-artifici-health-data | Artificial Intelligence; Big Data; Public Health |
breast-cancer-imag-base-studi | Humans; Breast Neoplasms; Female; Middle Aged; Aged; Breast; Machine Learning; Mammography; Retrospective Studies; Adult; Aged, 80 and over; Algorithms; Breast Density; Deep Learning; Early Detection of Cancer; Image Interpretation, Computer-Assisted; Image Processing, Computer-Assisted; Magnetic Resonance Imaging; Male; Neoplasms; Risk Assessment; ROC Curve; Sensitivity and Specificity; Ultrasonography, Mammary |
clinic-model-learn-method | Humans; Algorithms; Electronic Health Records; Natural Language Processing; Machine Learning; Neural Networks (Computer); Datasets as Topic; International Classification of Diseases; Artificial Intelligence; Bayes Theorem; Phenotype |
data-health-research-develop | Public Health; Data Mining; Databases, Factual; Delivery of Health Care; Medical Informatics; Artificial Intelligence; Biomedical Research; Electronic Health Records; Machine Learning; Translational Medical Research |
data-learn-base-studi | Female; Machine Learning; Male; Algorithms; Deep Learning; Middle Aged; Neural Networks (Computer); Aged; Image Processing, Computer-Assisted; Adult; Tomography, X-Ray Computed |
diabet-retinopathi-screen-imag-patient-learn-base | Humans; Diabetic Retinopathy; Female; Male; Aged; Aged, 80 and over; Middle Aged; Retina; Adult; Cross-Sectional Studies; Diagnosis, Computer-Assisted; Diagnostic Techniques, Ophthalmological; Image Processing, Computer-Assisted; Neural Networks (Computer); Reproducibility of Results; ROC Curve; Young Adult |
drug-advers-safeti-model-base-studi | Humans; Artificial Intelligence; Data Mining; Drug-Related Side Effects and Adverse Reactions; Neural Networks (Computer); Social Media; Area Under Curve; Automation, Laboratory; Back Pain; Biological Products; Computational Biology; Computer Simulation; Databases as Topic; Deep Learning; Drug Design; Drug Discovery; Drug Industry; Drug Interactions; Information Storage and Retrieval; Models, Chemical; Models, Theoretical; Natural Language Processing; Necrosis; Pharmacovigilance; Phytotherapy; Plants, Medicinal; Programming Languages; Publications; Robotics; Semantics; Software; Supervised Machine Learning |
genom-identifi-data-studi | Humans; Genome-Wide Association Study; Computational Biology; Genetic Predisposition to Disease; Algorithms; Databases, Genetic; Deep Learning; Female; Genome, Human; Genomics; Polymorphism, Single Nucleotide |
model-learn-data-studi | Machine Learning; Female; Neural Networks (Computer); Male; Algorithms; Deep Learning; Adult; Middle Aged; Aged; China; Prognosis |
pollut-qualiti-network-data | Air Pollution; Air Pollutants; Environmental Monitoring; Humans; Neural Networks (Computer); Cities; Forecasting; Algorithms; Automation; Beijing; Data Mining; Deep Learning; Electroencephalography; Electrooculography; Environmental Exposure; Epidemiologic Studies; Hong Kong; Inventions; Machine Learning; Models, Statistical; Models, Theoretical; Particulate Matter; Polysomnography; Sleep Stages; Sleep Wake Disorders; Smartphone |
resist-antibiot-antimicrobi-health | Humans; Anti-Bacterial Agents; Drug Resistance, Multiple, Bacterial; Machine Learning; Microbial Sensitivity Tests; Antimicrobial Cationic Peptides; Biofilms; Cell Membrane; DNA, Bacterial; Genome, Bacterial; High-Throughput Nucleotide Sequencing; Lipopolysaccharides; Sequence Analysis, DNA; Whole Genome Sequencing |
segment-imag-convolut-neural-deep-network-method-perform-result | Humans; Female; Male; Algorithms; Image Processing, Computer-Assisted; Middle Aged; Neural Networks (Computer); Adult; Magnetic Resonance Imaging; Aged; Deep Learning; Image Interpretation, Computer-Assisted; Young Adult |
social-health-data-base | Humans; Social Media; Neural Networks (Computer); Machine Learning; Adolescent; Adult; Algorithms; Analgesics, Opioid; Deep Learning; Female; Internet; Male; Middle Aged; Public Opinion; Young Adult |
student-educ-learn-studi | Female; Male; Curriculum; Students, Medical; Educational Measurement; Learning; Adult; Education, Medical, Undergraduate; Problem-Based Learning; Young Adult |
Lets explore articles for which public health is a Mesh heading.
ph <- labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
filter(str_detect(keywords, "Public Health"))
ph %>%
count(clus_names, sort = TRUE)
#> # A tibble: 4 x 2
#> clus_names n
#> <chr> <int>
#> 1 data-health-research-develop 7
#> 2 model-learn-data-studi 2
#> 3 student-educ-learn-studi 2
#> 4 ai-intellig-artifici-health-data 1
There is one article tagged with ai-intellig-artifici-health-data which has Public Health as a mesh heading. We can use epmc_ftxt
to extract the full text article.
library(rvest)
get_pmcids <- ph %>%
filter(clus_names == "data-research-health-develop") %>%
select(id, pmcid) %>%
filter(!is.na(pmcid))
details <- mutate(ids, details = map(get_ids, epmc_details))
full_text <- details %>%
mutate(full_text = map(details, "ftx")) %>%
unnest(full_text) %>%
filter(availability == "Free") %>%
left_join(get_pmcids, by = c("value" = "id")) %>%
distinct()
full_text <- europepmc::epmc_ftxt("PMC5171550")
ft <- full_text %>%
html_text()
ft %>%
str_split(., "\\. ") %>%
enframe() %>%
formattable::formattable()
Finally we can gather all the abstracts into a single interactive table which can be searched, filtered and shared.
labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
select(cluster, clus_names, doi, title, journalTitle, pubYear, citedByCount, absText) %>%
mutate(doi = paste0("<a href = https://", doi, ">doi</a>")) %>%
DT::datatable(escape = FALSE, extensions = c('Responsive','Buttons', 'FixedHeader'),
filter = "top",
options = list(
autoWidth = TRUE,
columnDefs = list(list(width = '450px')),
dom = 'Bfrtip',
buttons = c('csv', 'excel'),
fixedHeader=TRUE)
)