This can be a manual and protracted iterative process which may involve using specialised searching services, downloading abstracts, reading and filtering, secondary searching and so on, and may involve sifting many thousands of abstracts.
Often we may just want a rapid overview of the literature to help focus further reviewing.
In this vignette we demonstrate the use of R packages for large scale extraction of abstracts, and analytical techniques for identifying topics or themes in the abstracts.
The vignette is based on a number of R packages:
europepmc
- this is a sophisticated tool which interacts with the PubMedCentral API and provides access to additional fields.adjutant
- this is a fully fledged package with retrieval and clustering functions. 3.tidytext
- a package for text mining using tidy data principles.Rtsne
- this uses the tSNE algorithm for data reduction and cluster visualisationdbscan
- applies the HDBSCAN algorithm for data clusteringmyScrapers
- wraps some functions built on other packages to automate the search, extraction, and filtering process.We have “hacked” some of the functions in these packages and written additional functions to develop a work flow from searching and retrieval to analysis
europepmc
This is a package which allows searching of EuropePMC via the API.
It can be downloaded from CRAN.
if(!require("europepmc")) install.packages("europepmc")
library(europepmc)
The main function is epmc_search
which allows us to search the site and retrieve abstracts, metadata and citation counts.
We’ll use it with the search term blockchain AND health.
head(epmc_search(params$search, limit = 10))
#> # A tibble: 6 x 28
#> id source pmid doi title authorString journalTitle issue
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 3150~ MED 3150~ 10.2~ Appl~ Jin XL, Zha~ J Med Inter~ 9
#> 2 3141~ MED 3141~ 10.2~ Priv~ Jones M, Jo~ J Med Inter~ 8
#> 3 3147~ MED 3147~ 10.2~ A Bl~ Hylock RH, ~ J Med Inter~ 8
#> 4 3133~ MED 3133~ 10.2~ Clou~ Zhu X, Shi ~ J Med Inter~ 7
#> 5 3139~ MED 3139~ 10.3~ A Le~ Leeming G, ~ Front Med (~ <NA>
#> 6 3141~ MED 3141~ 10.1~ Med-~ Zhou T, Li ~ J Med Syst 9
#> # ... with 20 more variables: journalVolume <chr>, pubYear <chr>,
#> # journalIssn <chr>, pageInfo <chr>, pubType <chr>, isOpenAccess <chr>,
#> # inEPMC <chr>, inPMC <chr>, hasPDF <chr>, hasBook <chr>,
#> # citedByCount <int>, hasReferences <chr>, hasTextMinedTerms <chr>,
#> # hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> # hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> # firstPublicationDate <chr>, pmcid <chr>, hasSuppl <chr>
This doesn’t extract the abstract text or Mesh headings (keywords) - to facilitate this we have wrapped the search function, into get_full_search
in myScrapers
.
library(tictoc)
set.seed(42)
tic()
search1 <- get_full_search(search = params$search, limit = params$limit)
toc()
#> 188.88 sec elapsed
head(search1, 20)
#> # A tibble: 20 x 32
#> id source pmid doi title authorString journalTitle issue
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 3150~ MED 3150~ 10.2~ Appl~ Jin XL, Zha~ J Med Inter~ 9
#> 2 3141~ MED 3141~ 10.2~ Priv~ Jones M, Jo~ J Med Inter~ 8
#> 3 3147~ MED 3147~ 10.2~ A Bl~ Hylock RH, ~ J Med Inter~ 8
#> 4 3133~ MED 3133~ 10.2~ Clou~ Zhu X, Shi ~ J Med Inter~ 7
#> 5 3139~ MED 3139~ 10.3~ A Le~ Leeming G, ~ Front Med (~ <NA>
#> 6 3141~ MED 3141~ 10.1~ Med-~ Zhou T, Li ~ J Med Syst 9
#> 7 3143~ MED 3143~ 10.1~ Dece~ Coelho FC, ~ Mem Inst Os~ <NA>
#> 8 3132~ MED 3132~ 10.3~ A Se~ Kim M, Park~ Sensors (Ba~ 13
#> 9 3133~ MED 3133~ 10.3~ Bloc~ Shuaib K, S~ J Pers Med 3
#> 10 3132~ MED 3132~ 10.3~ A Bl~ Rathee G, S~ Sensors (Ba~ 14
#> 11 3129~ MED 3129~ 10.3~ Bloc~ Pop C, Anta~ Sensors (Ba~ 14
#> 12 3131~ MED 3131~ 10.3~ Bloc~ Derhab A, G~ Sensors (Ba~ 14
#> 13 3122~ MED 3122~ 10.2~ The ~ Esmaeilzade~ J Med Inter~ 6
#> 14 3143~ MED 3143~ 10.3~ Poss~ Giordanengo~ Stud Health~ <NA>
#> 15 3135~ MED 3135~ 10.3~ Enab~ Fernández-C~ Sensors (Ba~ 15
#> 16 3134~ MED 3134~ 10.3~ Bloc~ Bernardi F,~ Stud Health~ <NA>
#> 17 3134~ MED 3134~ 10.3~ Movi~ Balis C, Ta~ Stud Health~ <NA>
#> 18 3134~ MED 3134~ 10.3~ Crow~ Mihelj J, Z~ Sensors (Ba~ 15
#> 19 3134~ MED 3134~ 10.3~ Bloc~ Shifrin M, ~ Stud Health~ <NA>
#> 20 3124~ MED 3124~ 10.3~ A Pe~ Cai W, Du X~ Sensors (Ba~ 12
#> # ... with 24 more variables: journalVolume <chr>, pubYear <chr>,
#> # journalIssn <chr>, pageInfo <chr>, pubType <chr>, isOpenAccess <chr>,
#> # inEPMC <chr>, inPMC <chr>, hasPDF <chr>, hasBook <chr>,
#> # citedByCount <int>, hasReferences <chr>, hasTextMinedTerms <chr>,
#> # hasDbCrossReferences <chr>, hasLabsLinks <chr>,
#> # hasTMAccessionNumbers <chr>, firstIndexDate <chr>,
#> # firstPublicationDate <chr>, pmcid <chr>, hasSuppl <chr>, name <int>,
#> # absText <list>, mesh <list>, keywords <chr>
We can see that the get_full_search
function returns addition metadata such as citation counts, whether the journal is open access and whether there is PDF available. By default, 1000 article descriptions are downloaded. It also includes mesh headings and abstract text.
we can see how many articles are available altogether by running epmc_profile
.
profile <- epmc_profile(query = params$search)
Running epmc_profile
allows us to see that there are 329 articles of which 254 are full text articles, and 232 are open access.
We can easily look at annual abstract frequency - we can readily see the growth in publication frequency in the last 3 years.
search1 %>%
count(pubYear) %>%
ggplot(aes(pubYear, n)) +
geom_col(fill = "blue") +
labs(title = "Abstracts per year",
subtitle = paste("Search: ", params$search)) +
phecharts::theme_phe() +
theme(axis.text.x = element_text(angle = 45 ,hjust = 1))
Similarly we can identify the most frequent journals
journal_count <- search1 %>%
count(journalTitle) %>%
top_n(20) %>%
arrange(-n)
journal_count %>%
ggplot(aes(reorder(journalTitle, n), n)) +
geom_col(fill = "blue") +
coord_flip() +
labs(title = "Journal frequency") +
phecharts::theme_phe()
Sensors (Basel) and PLoS One are the most frequent journals publishing articles on blockchain AND health.
Once we have a data frame of 329 records with abstract text, we can prepare the data for analysis. The create_corpus
function is designed for this.
out1 <- search1 %>%
select(pmid, pmcid ,doi, title, pubYear, citedByCount, absText, journalTitle) %>%
filter(absText != "NULL") %>%
mutate(text = paste(title, absText))
We will use a method exemplified in the adjutant
package which uses unsupervised machine learning to try and cluster similar articles and attach themes.
In this approach undertake some natural language processing. We will
The ultimate output of this analysis is a visualisation of clustered and labelled abstracts and a interactive table.
library(tidytext)
corp <- create_corpus(df = search1)
head(corp$corpus)
#> # A tibble: 6 x 6
#> pmid word n tf idf tf_idf
#> <chr> <chr> <int> <dbl> <dbl> <dbl>
#> 1 24505257 accumul 1 0.00943 4.36 0.0411
#> 2 24505257 agent 1 0.00943 4.14 0.0390
#> 3 24505257 amount 1 0.00943 3.04 0.0287
#> 4 24505257 analysi 1 0.00943 1.74 0.0164
#> 5 24505257 analyz 3 0.0283 2.28 0.0645
#> 6 24505257 attach 2 0.0189 4.65 0.0877
corp$corpus %>%
count(pmid)
#> # A tibble: 314 x 2
#> pmid n
#> <chr> <int>
#> 1 24505257 79
#> 2 25874694 57
#> 3 27037387 8
#> 4 27239273 44
#> 5 27240373 60
#> 6 27565509 72
#> 7 27638214 99
#> 8 27695049 76
#> 9 27768691 70
#> 10 28029119 91
#> # ... with 304 more rows
clust <- create_cluster(corpus = corp$corpus, minPts = params$minPts, perplexity = params$perplexity)
#> If there are small numbers of abstracts,
#> try lowering the perpexlity value to less than 30% of the number of returns9.15 sec elapsed
clust$cluster_size
#> # A tibble: 4 x 2
#> cluster n
#> <dbl> <int>
#> 1 0 47
#> 2 1 10
#> 3 2 20
#> 4 3 236
labels <- label_clusters(corp$corpus, clustering = clust$clustering, top_n = 4)
#> 0.14 sec elapsed
labels$labels
#> # A tibble: 4 x 2
#> # Groups: cluster [4]
#> cluster clus_names
#> <dbl> <chr>
#> 1 0 blockchain-technologi-result-paper-system
#> 2 1 trial-clinic-blockchain-data
#> 3 2 secur-data-blockchain-base-propos
#> 4 3 data-technologi-system-base
p <- labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
ggplot(aes(X1, X2)) +
geom_point(aes(colour = clustered, size = citedByCount) ) +
ggrepel::geom_text_repel(data = labels$plot, aes(medX, medY, label = clus_names), size = 3, colour = "red", alpha = 0.9)
p + scale_alpha_manual(values=c(1,0)) +
viridis::scale_color_viridis(discrete = TRUE, option = "viridis", alpha = .5, begin = .8, end = .1, direction = -1) +
phecharts::theme_phe() +
theme(panel.background = element_rect(fill = "#ffffff")) +
labs(subtitle = paste("Clustering: ", nrow(labels$plot), " topics" ),
title = paste("Search ", "= ", params$search ))
most_cited <- labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
filter(cluster !=0) %>%
group_by(clus_names) %>%
top_n(n = 3, citedByCount) %>%
select(clus_names, title, pubYear, citedByCount) %>%
ungroup() %>%
arrange(clus_names, -citedByCount)
most_cited %>%
formattable::formattable()
clus_names | title | pubYear | citedByCount |
---|---|---|---|
data-technologi-system-base | Opportunities and obstacles for deep learning in biology and medicine. | 2018 | 41 |
data-technologi-system-base | Blockchain distributed ledger technologies for biomedical and health care applications. | 2017 | 26 |
data-technologi-system-base | Healthcare Data Gateways: Found Healthcare Intelligence on Blockchain with Novel Privacy Risk Control. | 2016 | 19 |
secur-data-blockchain-base-propos | Secure Cloud-Based EHR System Using Attribute-Based Cryptosystem and Blockchain. | 2018 | 5 |
secur-data-blockchain-base-propos | Secure and Trustable Electronic Medical Records Sharing using Blockchain. | 2017 | 3 |
secur-data-blockchain-base-propos | Combining Cryptography with EEG Biometrics. | 2018 | 3 |
secur-data-blockchain-base-propos | Blockchain-Based Data Preservation System for Medical Data. | 2018 | 3 |
trial-clinic-blockchain-data | Blockchain technology for improving clinical research quality. | 2017 | 14 |
trial-clinic-blockchain-data | How blockchain-timestamped protocols could improve the trustworthiness of medical science. | 2016 | 4 |
trial-clinic-blockchain-data | Improving data transparency in clinical trials using blockchain smart contracts. | 2016 | 4 |
We can review the commonest Mesh headings associated with each cluster tag.
labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
select(clus_names, mesh) %>%
filter(mesh != "NULL") %>%
unnest(mesh) %>%
count(clus_names, mesh,sort = TRUE) %>%
filter(n < 30) %>%
ungroup() %>%
group_by(clus_names) %>%
top_n(10) %>%
mutate(summary = paste(mesh, collapse = "; " )) %>%
select(-c(mesh, n)) %>%
distinct() %>%
arrange(clus_names) %>%
knitr::kable()
clus_names | summary |
---|---|
blockchain-technologi-result-paper-system | Humans; Genomics; Electronic Health Records; Genome, Human; Algorithms; American Medical Association; Commerce; Computer Security; Confidentiality; Cooperative Behavior; Food Safety; United States |
data-technologi-system-base | Computer Security; Delivery of Health Care; Electronic Health Records; Internet; Technology; Confidentiality; Information Dissemination; Privacy; Medical Informatics; Telemedicine |
secur-data-blockchain-base-propos | Computer Security; Humans; Electronic Health Records; Confidentiality; Privacy; Health Information Exchange; Information Dissemination; Algorithms; Cloud Computing; Information Storage and Retrieval; Insurance, Health; Telemedicine |
trial-clinic-blockchain-data | Humans; Clinical Trials as Topic; Computer Security; Internet; Algorithms; Confidentiality; Data Collection; Delivery of Health Care; Electronic Health Records; Information Dissemination; Medical Audit; Mobile Applications; Privacy; Proof of Concept Study; Quality Control; Quality Improvement; Research Design |
We cna extract systematic revews in a similar way.
sr <- labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
filter(str_detect(keywords, "Review")|str_detect(absText, "systematic review"))
table_sr <- sr %>%
select(title, journalTitle, pubYear, clus_names, keywords, absText)
There are 7 articles tagged with public health as a Mesh heading. These are shown in the table 2.
title | journalTitle | pubYear | clus_names | keywords | absText |
---|---|---|---|---|---|
Blockchain Technology: Applications in Health Care. | Circ Cardiovasc Qual Outcomes | 2017 | blockchain-technologi-result-paper-system | c(“Humans”, “Confidentiality”, “Biomedical Technology”, “Diffusion of Innovation”, “Computer Security”, “Database Management Systems”, “Insurance Claim Review”, “Electronic Health Records”, “Administrative Claims, Healthcare”) | NULL |
(Block) Chain Reaction: A Blockchain Revolution Sweeps into Health Care, Offering the Possibility for a Much-Needed Data Solution. | IEEE Pulse | 2018 | data-technologi-system-base | c(“Humans”, “Databases, Factual”, “Insurance Claim Review”, “Electronic Health Records”) | Electronic health records may have digitized patient data, but getting that data from one clinician to another remains a huge challenge, especially since patients often have multiple doctors ordering tests, prescribing drugs, and providing treatment. Many experts now believe that blockchain technology might be just the thing to get a patient’s pertinent medical information from where it is stored to where it is needed, as well as to allow patients to easily view their own medical histories. In addition, blockchain technology might also be able to help with other aspects of health care, such as improving the insurance claim or other administrative processes within healthcare networks and making health-related population data available to biomedical researchers. |
Findings from 2017 on Health Information Management | Yearb Med Inform | 2018 | data-technologi-system-base | c(“Humans”, “Confidentiality”, “Health Policy”, “Health Records, Personal”, “Health Information Management”, “Health Information Exchange”, “Data Anonymization”) | OBJECTIVE:To summarize the recent literature and research and present a selection of the best papers published in 2017 in the field of Health Information Management (HIM) and Health Informatics. METHODS:A systematic review of the literature was performed by the two HIM section editors of the International Medical Informatics Association (IMIA) Yearbook with the help of a medical librarian. We searched bibliographic databases for HIM-related papers using both MeSH descriptors and keywords in titles and abstracts. A shortlist of 15 candidate best papers was first selected by section editors before being peer-reviewed by independent external reviewers. RESULTS:Health Information Exchange was a major theme within candidate best papers. The four papers ultimately selected as ‘Best Papers’ represent themes that include health information exchange, governance and policy issues, results of health information exchange, and methods of integrating information from multiple sources. Other articles within the candidate best papers include these themes as well as those focusing on authentication and de-identification and usability of information systems. CONCLUSIONS:The papers discussed in the HIM section of IMIA Yearbook reflect the overall theme of the 2018 edition of the Yearbook, i.e., the tension between privacy and access to information. While most of the papers focused on health information exchange, which reflects the “access” side of the equation, most of the others addressed privacy issues. This synopsis discusses these key issues at the intersection of HIM and informatics. |
Implementing Blockchains for Efficient Health Care: Systematic Review. | J Med Internet Res | 2019 | data-technologi-system-base | NULL | BACKGROUND:The decentralized nature of sensitive health information can bring about situations where timely information is unavailable, worsening health outcomes. Furthermore, as patient involvement in health care increases, there is a growing need for patients to access and control their data. Blockchain is a secure, decentralized online ledger that could be used to manage electronic health records (EHRs) efficiently, therefore with the potential to improve health outcomes by creating a conduit for interoperability. OBJECTIVE:This study aimed to perform a systematic review to assess the feasibility of blockchain as a method of managing health care records efficiently. METHODS:Reviewers identified studies via systematic searches of databases including PubMed, MEDLINE, Scopus, EMBASE, ProQuest, and Cochrane Library. Suitability for inclusion of each was assessed independently. RESULTS:Of the 71 included studies, the majority discuss potential benefits and limitations without evaluation of their effectiveness, although some systems were tested on live data. CONCLUSIONS:Blockchain could create a mechanism to manage access to EHRs stored on the cloud. Using a blockchain can increase interoperability while maintaining privacy and security of data. It contains inherent integrity and conforms to strict legal regulations. Increased interoperability would be beneficial for health outcomes. Although this technology is currently unfamiliar to most, investments into creating a sufficiently user-friendly interface and educating users on how best to take advantage of it would lead to improved health outcomes. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID):RR2-10.2196/10994. |
Comparison of blockchain platforms: a systematic review and healthcare examples. | J Am Med Inform Assoc | 2019 | data-technologi-system-base | NULL | OBJECTIVES:To introduce healthcare or biomedical blockchain applications and their underlying blockchain platforms, compare popular blockchain platforms using a systematic review method, and provide a reference for selection of a suitable blockchain platform given requirements and technical features that are common in healthcare and biomedical research applications. TARGET AUDIENCE:Healthcare or clinical informatics researchers and software engineers who would like to learn about the important technical features of different blockchain platforms to design and implement blockchain-based health informatics applications. SCOPE:Covered topics include (1) a brief introduction to healthcare or biomedical blockchain applications and the benefits to adopt blockchain; (2) a description of key features of underlying blockchain platforms in healthcare applications; (3) development of a method for systematic review of technology, based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement, to investigate blockchain platforms for healthcare and medicine applications; (4) a review of 21 healthcare-related technical features of 10 popular blockchain platforms; and (5) a discussion of findings and limitations of the review. |
Blockchain Technology in Healthcare: A Systematic Review. | Healthcare (Basel) | 2019 | data-technologi-system-base | NULL | Since blockchain was introduced through Bitcoin, research has been ongoing to extend its applications to non-financial use cases. Healthcare is one industry in which blockchain is expected to have significant impacts. Research in this area is relatively new but growing rapidly; so, health informatics researchers and practitioners are always struggling to keep pace with research progress in this area. This paper reports on a systematic review of the ongoing research in the application of blockchain technology in healthcare. The research methodology is based on the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines and a systematic mapping study process, in which a well-designed search protocol is used to search four scientific databases, to identify, extract and analyze all relevant publications. The review shows that a number of studies have proposed different use cases for the application of blockchain in healthcare; however, there is a lack of adequate prototype implementations and studies to characterize the effectiveness of these proposed use cases. The review further highlights the state-of-the-art in the development of blockchain applications for healthcare, their limitations and the areas for future research. To this end, therefore, there is still the need for more research to better understand, characterize and evaluate the utility of blockchain in healthcare. |
Design Choices and Trade-Offs in Health Care Blockchain Implementations: Systematic Review. | J Med Internet Res | 2019 | data-technologi-system-base | NULL | BACKGROUND:A blockchain is a list of records that uses cryptography to make stored data immutable; their use has recently been proposed for electronic medical record (EMR) systems. This paper details a systematic review of trade-offs in blockchain technologies that are relevant to EMRs. Trade-offs are defined as “a compromise between two desirable but incompatible features.” OBJECTIVE:This review’s primary research question was: “What are the trade-offs involved in different blockchain designs that are relevant to the creation of blockchain-based electronic medical records systems?” METHODS:Seven databases were systematically searched for relevant articles using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Papers published from January 1, 2017 to June 15, 2018 were selected. Quality assessments of papers were performed using the Risk Of Bias In Non-randomized Studies-of Interventions (ROBINS-I) tool and the Critical Assessment Skills Programme (CASP) tool. Database searches identified 2885 articles, of which 15 were ultimately included for analysis. RESULTS:A total of 17 trade-offs were identified impacting the design, development, and implementation of blockchain systems; these trade-offs are organized into themes, including business, application, data, and technology architecture. CONCLUSIONS:The key findings concluded the following: (1) multiple trade-offs can be managed adaptively to improve EMR utility; (2) multiple trade-offs involve improving the security of blockchain systems at the cost of other features, meaning EMR efficacy highly depends on data protection standards; and (3) multiple trade-offs result in improved blockchain scalability. Consideration of these trade-offs will be important to the specific environment in which electronic medical records are being developed. This review also uses its findings to suggest useful design choices for a hypothetical National Health Service blockchain. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID):RR2-10.2196/10994. |
library(rvest)
get_pmcids <- sr %>%
select(id, pmcid) %>%
filter(!is.na(pmcid))
details <- mutate(get_pmcids, details = map(id, epmc_details))
full_text <- details %>%
mutate(full_text = map(details, "ftx")) %>%
unnest(full_text) %>%
filter(availability == "Open access", url != "pdf") %>%
select(id, url)
ftxt <- mutate(full_text, ftext = map(url, get_page_text)) %>%
unnest() %>%
distinct()
# summary_ftext <- ftxt %>%
# group_by(id) %>%
# mutate(col = paste(ftxt, collapse = " ")) %>%
# select(-ftext) %>%
# distinct() %>%
# mutate(summary = map(col, text_summariser, 6))
Finally we can gather all the abstracts into a single interactive table which can be searched, filtered and shared.
labels$results %>%
left_join(search1, by = c("pmid.value" = "pmid")) %>%
select(cluster, clus_names, doi, title, journalTitle, pubYear, citedByCount, absText) %>%
mutate(doi = paste0("<a href = http://google.com/search?q=", doi, ">doi</a>")) %>%
DT::datatable(escape = FALSE, extensions = c('Responsive','Buttons', 'FixedHeader'),
filter = "top",
options = list(
autoWidth = TRUE,
columnDefs = list(list(width = '450px')),
dom = 'Bfrtip',
buttons = c('csv', 'excel'),
fixedHeader=TRUE)
)