Objective

Query PubMed for the prevalence of ‘Down Syndrome’ versus ‘Trisomy 21’ in PubMed, and graph results by year to detect any changing trends.

Approach

Query the PubMed UI for the search terms, “down syndrome” or “trisomy 21”, and download the results as a csv file, pubmed_result_down_t21_all.csv. Note that these terms were pulled from the ‘index’ button in the query search, and minor variants and misspellings were ignored. The terms were used in quotes, as shown, because this is the form returned by the index. Examine the file for structure, and note that the year data is imbedded in the Properties columns (e.g., create date:2020/02/25 | first author:Zhang X). The regex, (?<=date:)\\d{4} extracts the date successfully from the Property column. Note that “downs syndrome” (with the s) is present in the dataset, but the frequency is in the teens, so just ignored it.

In the R Markdown, import the file, and extract the year from the Property column with str_extract() using the regex shown above. Use str_detect() with the appropriate regex to extract variants of the 2 terms, and place in a ’X_pref` df. Add the preference as text in the eponymous column, and bind the dfs together. Plot the results by grouping by year and using the Preference column for the fill, thus showing a stacked bar graph.

#Load libraries in this first chunk
library(data.table)
suppressPackageStartupMessages(library(dplyr)) 
library(ggplot2)
library(readr)
library(stringr)
library(yaml)
# library(knitr)
# library(kableExtra)

# Load external R scripts here
# How to embed in R Markdown: < https://yihui.name/knitr/demo/externalization/ >
knitr::read_chunk('scripts/yaml_metadata.R')
# knitr::read_chunk('../functions/kable_smalldf_left.R')
#Load data in this chunk

#### Input File Names
input_file      <- 'data/pubmed_result_down_t21_all.csv'
metadata_yml    <- 'data/pubmed_result_down_t21_all.yml'

### Data Input
down_or_t21 <- read_csv('data/pubmed_result_down_t21_all.csv')

Input Metadata

#### Read and Print Metadata from YAML format file ####
## Code chunk to read in yaml files and print out their contents
## The AudGenDB Science Team often uses YAML format files to record metadata about input and output files
## YAML format is both human and machine readable, and 
## has a sufficiently simple format that it can be written in a simple text editor with some knowledge of the format
## 'yamllint' in Mac OS X Terminal and Linux can be used to validate the hand-generated files

if (exists("metadata_yml")) {
    ## Ensure the necessary packages have been downloaded
    if(!is.element('yaml', installed.packages()[,1])){
      install.packages('yaml')} else {
        library(yaml)}
    if(!is.element('dplyr', installed.packages()[,1])){
      install.packages('dplyr')} else {
        library(dplyr)}
    if(!is.element('knitr', installed.packages()[,1])){
      install.packages('knitr')} else {
        library(knitr)}
    if(!is.element('kableExtra', installed.packages()[,1])){
      install.packages('kableExtra')} else {
        library(kableExtra)}
    
    # Input meta data and print it out
    meta_yml <- read_yaml(metadata_yml)
    meta_yml_out <- 
      data.frame( text = unlist(meta_yml)) %>% 
      mutate(Key = row.names(.)) %>% 
      rename(Value = text) %>% 
      select(Key, Value)
    knitr::kable(meta_yml_out, caption = paste0("Metadata from YAML File: ", metadata_yml)) %>%
      kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                                full_width = FALSE,
                                position = "left") %>%
      kableExtra::column_spec(1, width = "11em") %>% 
      kableExtra::column_spec(2, width = "30em", border_left = T)
} else {
  print("No YAML Metadata File Exists")
}
Metadata from YAML File: data/pubmed_result_down_t21_all.yml
Key Value
Name pubmed_result_down_t21_all.csv
Search Terms “down syndrome” or “trisomy 21”
Description Searched PubMed for variants that refer to Trisomy 21 to analyze the prevalence of the terms in the article titles. The analyses can be found in the ‘Downs R Project’ in the report, “Prevalence of Down Syndrome vs. Trisomy 21 in PubMed Titles” (DownVsT21PubMed.html).

Data Analysis

# Undertake basic stats of data in this chunk
down_or_t21 <- 
  down_or_t21 %>% 
  mutate(Year = str_extract(Properties, "(?<=date:)\\d{4}")) %>% 
  mutate(Year = as.numeric(Year))
input_count <- nrow(down_or_t21)

# Filter for Articles with 'Down Syndrome' in Title
down_pref <- 
  down_or_t21 %>% 
  filter(str_detect(Title, "[Dd]owns? [Ss]yndrome")) %>% 
  mutate(pref = 'Down')
down_count <- nrow(down_pref)

# Filter for Articles with 'Trisomy 21' in Title
t21_pref <- 
  down_or_t21 %>% 
  filter(str_detect(Title, "[Tt]risomy 21")) %>% 
  mutate(pref = 'T21')
t21_count <- nrow(t21_pref)

# Bring categorized data together into one df
all_pref <- 
  rbind(down_pref, t21_pref) %>% 
  arrange(Year, pref) %>% 
  rename(Preference = pref)

Graph Results

# Plot results
all_pref %>% 
  group_by(Year) %>% 
  ggplot(aes(x = Year, fill = Preference)) + 
  geom_bar() +  # histogram stat="count"
  ggtitle("Frequency of the Terms 'Down Syndrome' Versus 'Trisomy 21'\nin the Titles of PubMed Articles by Year") +
  ylab("Number of Articles") +
  theme(plot.title = element_text(hjust = 0.5))

Conclusion

Of the 38,799 articles downloaded from PubMed with the search terms, “down syndrome” or “trisomy 21”, 9,030 articles had Down Syndrome in the title, and 1,830 had Trisomy 21 in the title.

As demonstrated in the graph, the overwhelming preference is for Down Syndrome, and if anything, the preference has gotten greater as the years pass.



Click here for session info.

# This session info field shows the environment when the R script was run
# It will be hidden in documents, except for the summary above and a triangle
# Clicking on the triangle/summary statement will reveal these data in the browser
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.1.0  knitr_1.27        yaml_2.2.0        stringr_1.4.0    
## [5] readr_1.3.1       ggplot2_3.1.1     dplyr_0.8.0.1     data.table_1.12.2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2        highr_0.8         pillar_1.3.1      compiler_3.6.0   
##  [5] plyr_1.8.4        tools_3.6.0       digest_0.6.18     viridisLite_0.3.0
##  [9] evaluate_0.13     tibble_2.1.1      gtable_0.3.0      pkgconfig_2.0.2  
## [13] rlang_0.4.0       rstudioapi_0.10   xfun_0.12         xml2_1.2.0       
## [17] httr_1.4.0        withr_2.1.2       hms_0.4.2         webshot_0.5.1    
## [21] rprojroot_1.3-2   grid_3.6.0        tidyselect_0.2.5  glue_1.3.1       
## [25] R6_2.4.0          rmarkdown_2.0     purrr_0.3.2       magrittr_1.5     
## [29] backports_1.1.4   scales_1.0.0      htmltools_0.3.6   assertthat_0.2.1 
## [33] rvest_0.3.3       colorspace_1.4-1  labeling_0.3      stringi_1.4.3    
## [37] lazyeval_0.2.2    munsell_0.5.0     crayon_1.3.4
print(paste0("This R Markdown Document was run on ",format(Sys.Date(),"%d-%b-%Y")))
## [1] "This R Markdown Document was run on 05-Apr-2020"
# If the user in not inside RStudio, then this will send a notification to the OS 
# NOTE: Notifier was not on CRAN in March 2019
# To get package: 
#install.packages("devtools") if devtools is not installed
#devtools::install_github("gaborcsardi/notifier")
################################################
require(notifier)
msg <- paste0(doc_title, " is done!")
notifier::notify(msg, title = "R notification", image = NULL)