This article was inspired by a recent meeting to discuss public health data science. It outlines a method for rapidly searching the Pubmed database and analysing retrieved abstracts. We are interested in rapidly assessing the literature on the extent of discussion or application of data science in public health research and practice.
We used the RISmed
package which provides an R
interface to the Pubmed API to extract and analyse the most recent 12,000 abstract retrieved via a very non-specific search strategy. We used cluster analysis (topic modelling) to group and classify abstracts. We also searched the abstracts for the terms data science and big data and conclude there is currently a very small literature on data science in public/ population health and there is a potential large research agenda to help us understand how we can apply emerging data management and analytic techniques in modern public health practice.
Firstly we’ll load the relevant R
libraries.
Next, we’ll query the Pubmed API via the RISmed package. We’ll use a broad search strategy.
res1 <- EUtilsSummary("data + science, population + health",
type = "esearch",
db = "pubmed",
datetype = "pdat",
retmax = 12000,
mindate = 2005,
maxdate = 2016)
The query sent to Pubmed is ((“EPJ Data Sci”[Journal] OR (“data”[All Fields] AND “science”[All Fields]) OR “data science”[All Fields]) AND (“population”[MeSH Terms] OR “population”[All Fields] OR “population groups”[MeSH Terms] OR (“population”[All Fields] AND “groups”[All Fields]) OR “population groups”[All Fields]) AND (“health”[MeSH Terms] OR “health”[All Fields])) AND 2005[PDAT] : 2016[PDAT] which retrieves 1.434410^{4} Pubmed entries. The query string shows that data science is not a current MESH heading. I have restricted the download to 12,000 entries in the interests of time and limitations on the Pubmed API.
fetch <- EUtilsGet(res1, type = "efetch", db = "pubmed")
abstracts <- data.frame(title = fetch@ArticleTitle,
abstract = fetch@AbstractText,
journal = fetch@Title,
DOI = fetch@PMID,
year = fetch@YearPubmed)
## ensure abstracts are character fields (not factors)
abstracts <- abstracts %>% mutate(abstract = as.character(abstract))
abstracts %>%
head()
abstracts %>%
group_by(year) %>%
count() %>%
filter(year > 2013) %>%
ggplot(aes(year, n)) +
geom_point() +
geom_line() +
labs(title = "Pubmed articles with search terms `data science` & `population health` \n2015-2016", hjust = 0.5,
y = "Articles")
The total number of abstracts was 11987
cloud <- abstracts %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
Joining, by = "word"
cloud %>%
with(wordcloud(word, n, min.freq = 10, max.words = 1000, colors = brewer.pal(8, "Dark2")), scale = c(8,.3), per.rot = 0.4)
bigrams_united %>%
with(wordcloud(bigram, n, max.words = 1000, random.order = FALSE, colors = brewer.pal(9, "Set1"), scale = c(8, 0.3)), per.rot = 0.4)
cloud3 <- abstracts %>%
select(journal) %>%
group_by(journal) %>%
count(sort = TRUE)
cloud3 %>%
with(wordcloud(journal, n, min.freq = 10, random.order = FALSE, max.words = 80, colors = brewer.pal(9, "Set1")), rot.per = .6)
g <- abstracts[grepl("data science", abstracts$abstract),]
g1 <- g$DOI %>% list
abstracts <- abstracts %>%
mutate(DOI = as.character(DOI))
abstracts[abstracts$DOI %in% g1[[1]],] %>%
select(title, journal, DOI) %>%
knitr::kable()
title | journal | DOI | |
---|---|---|---|
49 | MO-FG-207B-01: Thorax/Lung. | Medical physics | 28048692 |
50 | MO-FG-207B-02: Breast. | Medical physics | 28048031 |
52 | MO-FG-207B-03: Brain. | Medical physics | 28047352 |
53 | MO-FG-207B-04: Respond to Therapy. | Medical physics | 28046682 |
54 | MO-FG-207B-00: State-of-the-Art in Radiomics in Radiology and Radiation Oncology. | Medical physics | 28046637 |
257 | Applying Multiple Data Collection Tools to Quantify Human Papillomavirus Vaccine Communication on Twitter. | Journal of medical Internet research | 27919863 |
4237 | Exploiting big data for critical care research. | Current opinion in critical care | 26348424 |
4985 | Spatial and temporal epidemiological analysis in the Big Data era. | Preventive veterinary medicine | 26092722 |
5953 | Strategic transformation of population studies: recommendations of the working group on epidemiology and population sciences from the National Heart, Lung, and Blood Advisory Council and Board of External Experts. | American journal of epidemiology | 25743324 |
6433 | OpenHealth Platform for Interactive Contextualization of Population Health Open Data. | AMIA ... Annual Symposium proceedings. AMIA Symposium | 26958160 |
NA
NA
g <- abstracts[grepl("big data", abstracts$abstract),]
g1 <- g$DOI %>% list
abstracts <- abstracts %>%
mutate(DOI = as.character(DOI))
abstracts[abstracts$DOI %in% g1[[1]],] %>%
select(title, journal, DOI, year) %>%
knitr::kable()
title | journal | DOI | year | |
---|---|---|---|---|
49 | MO-FG-207B-01: Thorax/Lung. | Medical physics | 28048692 | 2017 |
50 | MO-FG-207B-02: Breast. | Medical physics | 28048031 | 2017 |
52 | MO-FG-207B-03: Brain. | Medical physics | 28047352 | 2017 |
53 | MO-FG-207B-04: Respond to Therapy. | Medical physics | 28046682 | 2017 |
54 | MO-FG-207B-00: State-of-the-Art in Radiomics in Radiology and Radiation Oncology. | Medical physics | 28046637 | 2017 |
253 | Crowdsourcing Precision Cerebrovascular Health: Imaging and Cloud Seeding A Million Brains Initiative™. | Frontiers in medicine | 27921034 | 2016 |
658 | Clinical chemistry in higher dimensions: Machine-learning and enhanced prediction from routine clinical chemistry data. | Clinical biochemistry | 27452181 | 2016 |
1555 | Scaling up health knowledge at European level requires sharing integrated data: an approach for collection of database specification. | ClinicoEconomics and outcomes research : CEOR | 27358570 | 2016 |
1606 | Social Media and Population Health Virtual Exchange for Senior Nursing Students: An International Collaboration. | Studies in health technology and informatics | 27332439 | 2016 |
1896 | Translation in Data Mining to Advance Personalized Medicine for Health Equity. | Intelligent information management | 27195185 | 2016 |
1956 | Community Vital Signs: Taking the Pulse of the Community While Caring for Patients. | Journal of the American Board of Family Medicine : JABFM | 27170802 | 2016 |
3383 | Integration of molecular pathology, epidemiology and social science for global precision medicine. | Expert review of molecular diagnostics | 26636627 | 2015 |
3695 | Routinely collected data as a strategic resource for research: priorities for methods and workforce. | Public health research & practice | 26536502 | 2015 |
3951 | Latest developments in allergic rhinitis in Allergy for clinicians and researchers. | Allergy | 26443244 | 2015 |
4237 | Exploiting big data for critical care research. | Current opinion in critical care | 26348424 | 2015 |
4754 | NA | Journal of the American Medical Informatics Association : JAMIA | 26174867 | 2015 |
4834 | Epidemiology research in rheumatology-progress and pitfalls. | Nature reviews. Rheumatology | 26150125 | 2015 |
5953 | Strategic transformation of population studies: recommendations of the working group on epidemiology and population sciences from the National Heart, Lung, and Blood Advisory Council and Board of External Experts. | American journal of epidemiology | 25743324 | 2015 |
6212 | Using networks to combine "big data" and traditional surveillance to improve influenza predictions. | Scientific reports | 25634021 | 2015 |
7407 | Sensor, signal, and imaging informatics: big data and smart health technologies. | Yearbook of medical informatics | 25123735 | 2014 |
7408 | Big Data Usage Patterns in the Health Care Domain: A Use Case Driven Approach Applied to the Assessment of Vaccination Benefits and Risks. Contribution of the IMIA Primary Healthcare Working Group. | Yearbook of medical informatics | 25123718 | 2014 |
7593 | Big data for population-based cancer research: the integrated cancer information and surveillance system. | North Carolina medical journal | 25046092 | 2014 |
7684 | Big data in health care: using analytics to identify and manage high-risk and high-cost patients. | Health affairs (Project Hope) | 25006137 | 2014 |
9021 | Privacy-by-Design: Understanding Data Access Models for Secondary Data. | AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science | 24303251 | 2013 |
9159 | Prevention and management of noncommunicable disease: the IOC Consensus Statement, Lausanne 2013. | Clinical journal of sport medicine : official journal of the Canadian Academy of Sport Medicine | 24169298 | 2013 |
9845 | Transforming epidemiology for 21st century medicine and public health. | Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology | 23462917 | 2013 |
title | journal | DOI | year | |
---|---|---|---|---|
257 | Applying Multiple Data Collection Tools to Quantify Human Papillomavirus Vaccine Communication on Twitter. | Journal of medical Internet research | 27919863 | 2016 |
566 | Prospective functional classification of all possible missense variants in PPARG. | Nature genetics | 27749844 | 2016 |
658 | Clinical chemistry in higher dimensions: Machine-learning and enhanced prediction from routine clinical chemistry data. | Clinical biochemistry | 27452181 | 2016 |
718 | Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision. | Journal of machine learning research : JMLR | 27746703 | 2016 |
804 | Comparison of Approaches for Heart Failure Case Identification From Electronic Health Record Data. | JAMA cardiology | 27706470 | 2016 |
868 | Assessing methods for generalizing experimental impact estimates to target populations. | Journal of research on educational effectiveness | 27668031 | 2016 |
1401 | Predicting suicides after outpatient mental health visits in the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS). | Molecular psychiatry | 27431294 | 2016 |
1579 | Cardiac image modelling: Breadth and depth in heart disease. | Medical image analysis | 27349830 | 2016 |
1732 | Call for a Computer-Aided Cancer Detection and Classification Research Initiative in Oman. | Asian Pacific journal of cancer prevention : APJCP | 27268600 | 2016 |
1831 | The Importance of Computer Science for Public Health Training: An Opportunity and Call to Action. | JMIR public health and surveillance | 27227145 | 2016 |
1913 | Validating Machine Learning Algorithms for Twitter Data Against Established Measures of Suicidality. | JMIR mental health | 27185366 | 2016 |
2142 | Objective Assessment of Physical Activity: Classifiers for Public Health. | Medicine and science in sports and exercise | 27089222 | 2016 |
2176 | Do Staphylococcus epidermidis Genetic Clusters Predict Isolation Sources? | Journal of clinical microbiology | 27076664 | 2016 |
3039 | Automated Outcome Classification of Computed Tomography Imaging Reports for Pediatric Traumatic Brain Injury. | Academic emergency medicine : official journal of the Society for Academic Emergency Medicine | 26766600 | 2016 |
3142 | A land use regression model for ambient ultrafine particles in Montreal, Canada: A comparison of linear regression and a machine learning approach. | Environmental research | 26720396 | 2016 |
3168 | Single-cell analysis of targeted transcriptome predicts drug sensitivity of single cells within human myeloma tumors. | Leukemia | 26710886 | 2015 |
3754 | Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. | PLoS computational biology | 26513245 | 2015 |
4432 | Implications of Cardiovascular Disease Risk Assessment Using the WHO/ISH Risk Prediction Charts in Rural India. | PloS one | 26287807 | 2015 |
4498 | Thirty years of artificial intelligence in medicine (AIME) conferences: A review of research themes. | Artificial intelligence in medicine | 26265491 | 2015 |
4660 | Lung necrosis and neutrophils reflect common pathways of susceptibility to Mycobacterium tuberculosis in genetically diverse, immune-competent mice. | Disease models & mechanisms | 26204894 | 2015 |
5022 | RAIRS2 a new expert system for diagnosing tuberculosis with real-world tournament selection mechanism inside artificial immune recognition system. | Medical & biological engineering & computing | 26081904 | 2015 |
5170 | Short-term Mortality Prediction for Elderly Patients Using Medicare Claims Data. | International journal of machine learning and computing | 28018571 | 2015 |
5249 | Using EHRs for Heart Failure Therapy Recommendation Using Multidimensional Patient Similarity Analytics. | Studies in health technology and informatics | 25991168 | 2015 |
5411 | Past and current use of walking measures for children with spina bifida: a systematic review. | Archives of physical medicine and rehabilitation | 25944500 | 2015 |
5511 | Mapping chemical structure-activity information of HAART-drug cocktails over complex networks of AIDS epidemiology and socioeconomic data of U.S. counties. | Bio Systems | 25916548 | 2015 |
7058 | The genetic interacting landscape of 63 candidate genes in Major Depressive Disorder: an explorative study. | BioData mining | 25279001 | 2014 |
7420 | Visualization and unsupervised predictive clustering of high-dimensional multimodal neuroimaging data. | Journal of neuroscience methods | 25117552 | 2014 |
8662 | NeuCube: a spiking neural network architecture for mapping, learning and understanding of spatio-temporal brain data. | Neural networks : the official journal of the International Neural Network Society | 24508754 | 2014 |
8865 | High-throughput neuro-imaging informatics. | Frontiers in neuroinformatics | 24381556 | 2014 |
9273 | A Machine Learning-Based Analysis of Game Data for Attention Deficit Hyperactivity Disorder Assessment. | Games for health journal | 26196929 | 2013 |
9430 | e-Labs and the stock of health method for simulating health policies. | Studies in health technology and informatics | 23920562 | 2013 |
abstracts %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words) %>%
count(DOI, word, sort = TRUE) %>%
cast_dtm(DOI, word, n) ->
abstracts1
Joining, by = "word"
library(topicmodels)
abs_lda <- LDA(abstracts1, k = 10, control = list(seed = 1234))
abs_lda_td <- tidytext:::tidy.LDA(abs_lda)
abs_lda_gamma <- tidytext:::tidy.LDA(abs_lda, matrix = "gamma")
abs_class <- abs_lda_gamma %>%
group_by(document) %>%
top_n(1, gamma) %>%
ungroup() %>%
arrange(gamma)
abs_class %>%
sample_n(6)
abstracts %>%
group_by(journal) %>%
count(sort = TRUE) %>%
filter(n >=40) ->top40
abstracts %>%
left_join(top40) %>%
filter(!is.na(n)) %>%
rename(document = DOI) %>%
left_join(abs_class) %>%
filter(!is.na(topic)) %>%
ggplot(aes(document, factor(topic))) +
geom_jitter(aes(colour = factor(topic)), size = 1, width = 0.1) +
facet_wrap(~journal) +
theme(strip.text.x = element_text(size = 8), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.text = element_text(size =10)) +
scale_color_viridis(discrete = TRUE, option = "C") +
theme(panel.background = element_rect(fill = "aliceblue")) +
labs(title = "Top 15 journals",
subtitle = "Documents by topic",
x = "Time")
NA