NCBI GEO Metadata Survey

Introduction

This is just a little exercise showing the power of dplyr for data munging in the context of a realistic biological relational database.

Get metadata sqlite file

The first step is to load the GEOmetadb library which provides the very light infrastructure for the separate SQLite database. After loading the library, retrieve the SQLite database file–updated weekly–for local queries against the NCBI GEO metadata.

library(GEOmetadb)
sfile = 'GEOmetadb.sqlite'
if(!file.exists(sfile)) {
  sfile = getSQLiteFile() 
}

Set up dplyr stuff….

Just for fun, instead of using SQL queries, we can rely on dplyr to do the SQL for us.

library(dplyr)
gmdb = src_sqlite(sfile)
# List available tables in the database
src_tbls(gmdb)

##  [1] "gds"               "gds_subset"        "geoConvert"       
##  [4] "geodb_column_desc" "gpl"               "gse"              
##  [7] "gse_gpl"           "gse_gsm"           "gsm"              
## [10] "metaInfo"          "sMatrix"

tgse = tbl(gmdb,'gse')
tgsm = tbl(gmdb,'gsm')
tgpl = tbl(gmdb,'gpl')

Queries of interest

For the purposes of this little exercise, we do three sets of queries:

A count of the rows in the three main “data” tables
A quick survey of multiplatform GEO Series
A plot of trends over time of the most common experiment types

Simple table summaries

tablestats = data.frame(entity=c('Series','Samples','Platforms'),
                        Count=c(nrow(tgse),nrow(tgsm),nrow(tgpl)))
pander(tablestats,justify=c('left','right'))

entity	Count
Series	56976
Samples	1370122
Platforms	14291

Multiplatform datasets by gpl

One way of measuring multiplatform GEO Series in GEO is to map the GSE identifiers to their associated GPL identifiers. This probably overestimates the actual complexity of the data, as some GEO series map to multiple arrays of the same technology. For example, there is a GEO Series, GSE1097, that has 153 platforms, but this was a set of whole-genome tiling arrays.

tgse_gpl = tbl(gmdb,'gse_gpl')
gse_gpl_count = select(tgse_gpl,gse) %>% 
  group_by(gse) %>%
  summarise(count=n()) %>%
  filter(count>1) %>%
  collect()
# and make a cut base on number of platforms associated with a GSE
cut_count = cut(gse_gpl_count$count,
                breaks=c(2,3,5,10,20,max(gse_gpl_count$count)))
table(cut_count)

## cut_count
##    (2,3]    (3,5]   (5,10]  (10,20] (20,153] 
##     1046      493      177       73       31

And a small barplot with the binned number of GPLs per GSE:

barplot(table(cut_count),
        main="GEO Series by Number of GEO Platforms",
        sub="Includes only GEO Series with more than one platform")

Multiplatform based on assay type

Another way to quantify “multiplatform” is to look at the different assay types in a GSE. This may underestimate the number somewhat, but it is probably closer to what we want to capture. The type column in the GSE table includes a semicolon-separated list of experiment types. While perhaps not a perfect list, it at least allows us to quantify the number of experiments of each type. The number of each type is given in the next table.

library(tidyr)
gse_type = select(tgse,gse,type) %>%
  transform(type = strsplit(type,';\\t')) %>%
  unnest(type) 
type_count = select(gse_type,type) %>%
  group_by(type) %>%
  summarize(count=n()) %>% 
  arrange(desc(count))
pander(type_count,justify=c('left','right'))

type	count
Expression profiling by array	39707
Expression profiling by high throughput sequencing	4478
Genome binding/occupancy profiling by high throughput sequencing	3798
Genome binding/occupancy profiling by genome tiling array	2105
Non-coding RNA profiling by array	2100
Non-coding RNA profiling by high throughput sequencing	1423
Other	1096
Genome variation profiling by genome tiling array	1055
Genome variation profiling by SNP array	801
Methylation profiling by high throughput sequencing	730
Methylation profiling by genome tiling array	678
Expression profiling by genome tiling array	634
Genome variation profiling by array	593
Methylation profiling by array	542
SNP genotyping by SNP array	502
Expression profiling by RT-PCR	316
Expression profiling by SAGE	242
Genome binding/occupancy profiling by array	169
Protein profiling by protein array	166
Third-party reanalysis	130
Non-coding RNA profiling by genome tiling array	106
Genome variation profiling by high throughput sequencing	62
Expression profiling by MPSS	20
Expression profiling by SNP array	13
Genome binding/occupancy profiling by SNP array	12
Methylation profiling by SNP array	9
Protein profiling by Mass Spec	6
Genome binding/occupancy profiling by high throughput sequen	1

We can summarize the number of different types per GEO Series. The results are given in the next table.

gse_by_type_count = select(gse_type,gse) %>%
  group_by(gse) %>%
  summarize(NumberOfDataTypes=n()) %>%
  group_by(NumberOfDataTypes) %>% 
  summarize(NumberOfGEOSeries=n())
pander(gse_by_type_count,justify=c('right','right'))

NumberOfDataTypes	NumberOfGEOSeries
1	53066
2	3382
3	456
4	64
5	8

Data growth over time

I have summarized the data by profiling type and year of submission. To keep the data clean, I have included only the 6 most common GSE types for the plot. Feel free to adjust.

library(lubridate)
gse_type_year = select(tgse,gse,type,year=submission_date) %>%
  transform(year=year(as.Date(year))) %>%
  filter(year<2015) %>%
  transform(type = strsplit(type,';\\t')) %>%
  unnest(type) %>% 
  right_join(type_count[1:6,'type'])

The following plot shows submissions per year (not cumulative number of submissions) cut off at 2014. There are clearly trends toward sequencing over microarrays, but microarray expression profiling still greatly exceeds sequencing–note y-scale is log10.

library(ggplot2)
group_by(gse_type_year,type,year) %>%
  summarize(submissions=n()) %>%
  ggplot(aes(x=year,y=submissions,group=type,color=type)) +
  geom_line(aes(linetype=type),size=1.3) + 
  scale_y_log10() + 
  theme(legend.position="bottom") + scale_colour_grey() +
  guides(colour = guide_legend(nrow = 6))

The next plot limits the submissions to those with more than one data type.

# some really fun dplyr foo!
library(zoo)
select(gse_type,gse) %>% 
  group_by(gse) %>%
  summarize(datatypes=n()) %>%
  filter(datatypes>1) %>%
  left_join(gse_type_year) %>%
  transform(year=as.Date(as.yearmon(year))) %>%
  group_by(type,year) %>%
  summarize(submissions=n()) %>%
  ggplot(aes(x=year,y=submissions,group=type)) +
    geom_line() +
    scale_y_log10() + 
    theme(legend.position="bottom", 
            axis.text=element_text(size=rel(1.5)), 
            axis.title=element_text(size=rel(1.5)), 
            legend.text=element_text(size=rel(1.1))) +
    xlab("Year") + ylab("Number of Submissions") +
    geom_point(aes(shape=type),size=2.5) +
      guides(shape=guide_legend(nrow = 6))

sessionInfo()

sessionInfo()

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] zoo_1.7-11          ggplot2_1.0.1       lubridate_1.3.3    
##  [4] tidyr_0.2.0         dplyr_0.4.1         GEOmetadb_1.27.0   
##  [7] RSQLite_1.0.0       DBI_0.3.1           GEOquery_2.34.0    
## [10] Biobase_2.27.1      BiocGenerics_0.14.0 pander_0.5.1       
## [13] knitr_1.8          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6      formatR_1.0      plyr_1.8.1       tools_3.2.1     
##  [5] digest_0.6.8     lattice_0.20-31  evaluate_0.5.5   memoise_0.2.1   
##  [9] gtable_0.1.2     yaml_2.1.13      proto_0.3-10     stringr_0.6.2   
## [13] grid_3.2.1       R6_2.0.1         XML_3.98-1.1     rmarkdown_0.7   
## [17] reshape2_1.4     magrittr_1.5     scales_0.2.4     codetools_0.2-11
## [21] htmltools_0.2.6  MASS_7.3-40      assertthat_0.1   colorspace_1.2-4
## [25] labeling_0.3     RCurl_1.95-4.3   lazyeval_0.1.10  munsell_0.4.2