Introduction

This is just a little exercise showing the power of dplyr for data munging in the context of a realistic biological relational database.

Get metadata sqlite file

The first step is to load the GEOmetadb library which provides the very light infrastructure for the separate SQLite database. After loading the library, retrieve the SQLite database file–updated weekly–for local queries against the NCBI GEO metadata.

library(GEOmetadb)
sfile = 'GEOmetadb.sqlite'
if(!file.exists(sfile)) {
  sfile = getSQLiteFile() 
}

Set up dplyr stuff….

Just for fun, instead of using SQL queries, we can rely on dplyr to do the SQL for us.

library(dplyr)
gmdb = src_sqlite(sfile)
# List available tables in the database
src_tbls(gmdb)
##  [1] "gds"               "gds_subset"        "geoConvert"       
##  [4] "geodb_column_desc" "gpl"               "gse"              
##  [7] "gse_gpl"           "gse_gsm"           "gsm"              
## [10] "metaInfo"          "sMatrix"
tgse = tbl(gmdb,'gse')
tgsm = tbl(gmdb,'gsm')
tgpl = tbl(gmdb,'gpl')

Queries of interest

For the purposes of this little exercise, we do three sets of queries:

Simple table summaries

tablestats = data.frame(entity=c('Series','Samples','Platforms'),
                        Count=c(nrow(tgse),nrow(tgsm),nrow(tgpl)))
pander(tablestats,justify=c('left','right'))
entity Count
Series 56976
Samples 1370122
Platforms 14291

Multiplatform datasets by gpl

One way of measuring multiplatform GEO Series in GEO is to map the GSE identifiers to their associated GPL identifiers. This probably overestimates the actual complexity of the data, as some GEO series map to multiple arrays of the same technology. For example, there is a GEO Series, GSE1097, that has 153 platforms, but this was a set of whole-genome tiling arrays.

tgse_gpl = tbl(gmdb,'gse_gpl')
gse_gpl_count = select(tgse_gpl,gse) %>% 
  group_by(gse) %>%
  summarise(count=n()) %>%
  filter(count>1) %>%
  collect()
# and make a cut base on number of platforms associated with a GSE
cut_count = cut(gse_gpl_count$count,
                breaks=c(2,3,5,10,20,max(gse_gpl_count$count)))
table(cut_count)
## cut_count
##    (2,3]    (3,5]   (5,10]  (10,20] (20,153] 
##     1046      493      177       73       31

And a small barplot with the binned number of GPLs per GSE:

barplot(table(cut_count),
        main="GEO Series by Number of GEO Platforms",
        sub="Includes only GEO Series with more than one platform")

Multiplatform based on assay type

Another way to quantify “multiplatform” is to look at the different assay types in a GSE. This may underestimate the number somewhat, but it is probably closer to what we want to capture. The type column in the GSE table includes a semicolon-separated list of experiment types. While perhaps not a perfect list, it at least allows us to quantify the number of experiments of each type. The number of each type is given in the next table.

library(tidyr)
gse_type = select(tgse,gse,type) %>%
  transform(type = strsplit(type,';\\t')) %>%
  unnest(type) 
type_count = select(gse_type,type) %>%
  group_by(type) %>%
  summarize(count=n()) %>% 
  arrange(desc(count))
pander(type_count,justify=c('left','right'))
type count
Expression profiling by array 39707
Expression profiling by high throughput sequencing 4478
Genome binding/occupancy profiling by high throughput sequencing 3798
Genome binding/occupancy profiling by genome tiling array 2105
Non-coding RNA profiling by array 2100
Non-coding RNA profiling by high throughput sequencing 1423
Other 1096
Genome variation profiling by genome tiling array 1055
Genome variation profiling by SNP array 801
Methylation profiling by high throughput sequencing 730
Methylation profiling by genome tiling array 678
Expression profiling by genome tiling array 634
Genome variation profiling by array 593
Methylation profiling by array 542
SNP genotyping by SNP array 502
Expression profiling by RT-PCR 316
Expression profiling by SAGE 242
Genome binding/occupancy profiling by array 169
Protein profiling by protein array 166
Third-party reanalysis 130
Non-coding RNA profiling by genome tiling array 106
Genome variation profiling by high throughput sequencing 62
Expression profiling by MPSS 20
Expression profiling by SNP array 13
Genome binding/occupancy profiling by SNP array 12
Methylation profiling by SNP array 9
Protein profiling by Mass Spec 6
Genome binding/occupancy profiling by high throughput sequen 1

We can summarize the number of different types per GEO Series. The results are given in the next table.

gse_by_type_count = select(gse_type,gse) %>%
  group_by(gse) %>%
  summarize(NumberOfDataTypes=n()) %>%
  group_by(NumberOfDataTypes) %>% 
  summarize(NumberOfGEOSeries=n())
pander(gse_by_type_count,justify=c('right','right'))
NumberOfDataTypes NumberOfGEOSeries
1 53066
2 3382
3 456
4 64
5 8

Data growth over time

I have summarized the data by profiling type and year of submission. To keep the data clean, I have included only the 6 most common GSE types for the plot. Feel free to adjust.

library(lubridate)
gse_type_year = select(tgse,gse,type,year=submission_date) %>%
  transform(year=year(as.Date(year))) %>%
  filter(year<2015) %>%
  transform(type = strsplit(type,';\\t')) %>%
  unnest(type) %>% 
  right_join(type_count[1:6,'type'])

The following plot shows submissions per year (not cumulative number of submissions) cut off at 2014. There are clearly trends toward sequencing over microarrays, but microarray expression profiling still greatly exceeds sequencing–note y-scale is log10.

library(ggplot2)
group_by(gse_type_year,type,year) %>%
  summarize(submissions=n()) %>%
  ggplot(aes(x=year,y=submissions,group=type,color=type)) +
  geom_line(aes(linetype=type),size=1.3) + 
  scale_y_log10() + 
  theme(legend.position="bottom") + scale_colour_grey() +
  guides(colour = guide_legend(nrow = 6))

The next plot limits the submissions to those with more than one data type.

# some really fun dplyr foo!
library(zoo)
select(gse_type,gse) %>% 
  group_by(gse) %>%
  summarize(datatypes=n()) %>%
  filter(datatypes>1) %>%
  left_join(gse_type_year) %>%
  transform(year=as.Date(as.yearmon(year))) %>%
  group_by(type,year) %>%
  summarize(submissions=n()) %>%
  ggplot(aes(x=year,y=submissions,group=type)) +
    geom_line() +
    scale_y_log10() + 
    theme(legend.position="bottom", 
            axis.text=element_text(size=rel(1.5)), 
            axis.title=element_text(size=rel(1.5)), 
            legend.text=element_text(size=rel(1.1))) +
    xlab("Year") + ylab("Number of Submissions") +
    geom_point(aes(shape=type),size=2.5) +
      guides(shape=guide_legend(nrow = 6)) 

sessionInfo()

sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] zoo_1.7-11          ggplot2_1.0.1       lubridate_1.3.3    
##  [4] tidyr_0.2.0         dplyr_0.4.1         GEOmetadb_1.27.0   
##  [7] RSQLite_1.0.0       DBI_0.3.1           GEOquery_2.34.0    
## [10] Biobase_2.27.1      BiocGenerics_0.14.0 pander_0.5.1       
## [13] knitr_1.8          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6      formatR_1.0      plyr_1.8.1       tools_3.2.1     
##  [5] digest_0.6.8     lattice_0.20-31  evaluate_0.5.5   memoise_0.2.1   
##  [9] gtable_0.1.2     yaml_2.1.13      proto_0.3-10     stringr_0.6.2   
## [13] grid_3.2.1       R6_2.0.1         XML_3.98-1.1     rmarkdown_0.7   
## [17] reshape2_1.4     magrittr_1.5     scales_0.2.4     codetools_0.2-11
## [21] htmltools_0.2.6  MASS_7.3-40      assertthat_0.1   colorspace_1.2-4
## [25] labeling_0.3     RCurl_1.95-4.3   lazyeval_0.1.10  munsell_0.4.2