This is just a little exercise showing the power of dplyr for data munging in the context of a realistic biological relational database.
The first step is to load the GEOmetadb library which provides the very light infrastructure for the separate SQLite database. After loading the library, retrieve the SQLite database file–updated weekly–for local queries against the NCBI GEO metadata.
library(GEOmetadb)
sfile = 'GEOmetadb.sqlite'
if(!file.exists(sfile)) {
sfile = getSQLiteFile()
}
Just for fun, instead of using SQL queries, we can rely on dplyr to do the SQL for us.
library(dplyr)
gmdb = src_sqlite(sfile)
# List available tables in the database
src_tbls(gmdb)
## [1] "gds" "gds_subset" "geoConvert"
## [4] "geodb_column_desc" "gpl" "gse"
## [7] "gse_gpl" "gse_gsm" "gsm"
## [10] "metaInfo" "sMatrix"
tgse = tbl(gmdb,'gse')
tgsm = tbl(gmdb,'gsm')
tgpl = tbl(gmdb,'gpl')
For the purposes of this little exercise, we do three sets of queries:
tablestats = data.frame(entity=c('Series','Samples','Platforms'),
Count=c(nrow(tgse),nrow(tgsm),nrow(tgpl)))
pander(tablestats,justify=c('left','right'))
entity | Count |
---|---|
Series | 56976 |
Samples | 1370122 |
Platforms | 14291 |
One way of measuring multiplatform GEO Series in GEO is to map the GSE identifiers to their associated GPL identifiers. This probably overestimates the actual complexity of the data, as some GEO series map to multiple arrays of the same technology. For example, there is a GEO Series, GSE1097, that has 153 platforms, but this was a set of whole-genome tiling arrays.
tgse_gpl = tbl(gmdb,'gse_gpl')
gse_gpl_count = select(tgse_gpl,gse) %>%
group_by(gse) %>%
summarise(count=n()) %>%
filter(count>1) %>%
collect()
# and make a cut base on number of platforms associated with a GSE
cut_count = cut(gse_gpl_count$count,
breaks=c(2,3,5,10,20,max(gse_gpl_count$count)))
table(cut_count)
## cut_count
## (2,3] (3,5] (5,10] (10,20] (20,153]
## 1046 493 177 73 31
And a small barplot with the binned number of GPLs per GSE:
barplot(table(cut_count),
main="GEO Series by Number of GEO Platforms",
sub="Includes only GEO Series with more than one platform")
Another way to quantify “multiplatform” is to look at the different assay types in a GSE. This may underestimate the number somewhat, but it is probably closer to what we want to capture. The type
column in the GSE table includes a semicolon-separated list of experiment types. While perhaps not a perfect list, it at least allows us to quantify the number of experiments of each type. The number of each type
is given in the next table.
library(tidyr)
gse_type = select(tgse,gse,type) %>%
transform(type = strsplit(type,';\\t')) %>%
unnest(type)
type_count = select(gse_type,type) %>%
group_by(type) %>%
summarize(count=n()) %>%
arrange(desc(count))
pander(type_count,justify=c('left','right'))
type | count |
---|---|
Expression profiling by array | 39707 |
Expression profiling by high throughput sequencing | 4478 |
Genome binding/occupancy profiling by high throughput sequencing | 3798 |
Genome binding/occupancy profiling by genome tiling array | 2105 |
Non-coding RNA profiling by array | 2100 |
Non-coding RNA profiling by high throughput sequencing | 1423 |
Other | 1096 |
Genome variation profiling by genome tiling array | 1055 |
Genome variation profiling by SNP array | 801 |
Methylation profiling by high throughput sequencing | 730 |
Methylation profiling by genome tiling array | 678 |
Expression profiling by genome tiling array | 634 |
Genome variation profiling by array | 593 |
Methylation profiling by array | 542 |
SNP genotyping by SNP array | 502 |
Expression profiling by RT-PCR | 316 |
Expression profiling by SAGE | 242 |
Genome binding/occupancy profiling by array | 169 |
Protein profiling by protein array | 166 |
Third-party reanalysis | 130 |
Non-coding RNA profiling by genome tiling array | 106 |
Genome variation profiling by high throughput sequencing | 62 |
Expression profiling by MPSS | 20 |
Expression profiling by SNP array | 13 |
Genome binding/occupancy profiling by SNP array | 12 |
Methylation profiling by SNP array | 9 |
Protein profiling by Mass Spec | 6 |
Genome binding/occupancy profiling by high throughput sequen | 1 |
We can summarize the number of different types per GEO Series. The results are given in the next table.
gse_by_type_count = select(gse_type,gse) %>%
group_by(gse) %>%
summarize(NumberOfDataTypes=n()) %>%
group_by(NumberOfDataTypes) %>%
summarize(NumberOfGEOSeries=n())
pander(gse_by_type_count,justify=c('right','right'))
NumberOfDataTypes | NumberOfGEOSeries |
---|---|
1 | 53066 |
2 | 3382 |
3 | 456 |
4 | 64 |
5 | 8 |
I have summarized the data by profiling type and year of submission. To keep the data clean, I have included only the 6 most common GSE types for the plot. Feel free to adjust.
library(lubridate)
gse_type_year = select(tgse,gse,type,year=submission_date) %>%
transform(year=year(as.Date(year))) %>%
filter(year<2015) %>%
transform(type = strsplit(type,';\\t')) %>%
unnest(type) %>%
right_join(type_count[1:6,'type'])
The following plot shows submissions per year (not cumulative number of submissions) cut off at 2014. There are clearly trends toward sequencing over microarrays, but microarray expression profiling still greatly exceeds sequencing–note y-scale is log10.
library(ggplot2)
group_by(gse_type_year,type,year) %>%
summarize(submissions=n()) %>%
ggplot(aes(x=year,y=submissions,group=type,color=type)) +
geom_line(aes(linetype=type),size=1.3) +
scale_y_log10() +
theme(legend.position="bottom") + scale_colour_grey() +
guides(colour = guide_legend(nrow = 6))
The next plot limits the submissions to those with more than one data type.
# some really fun dplyr foo!
library(zoo)
select(gse_type,gse) %>%
group_by(gse) %>%
summarize(datatypes=n()) %>%
filter(datatypes>1) %>%
left_join(gse_type_year) %>%
transform(year=as.Date(as.yearmon(year))) %>%
group_by(type,year) %>%
summarize(submissions=n()) %>%
ggplot(aes(x=year,y=submissions,group=type)) +
geom_line() +
scale_y_log10() +
theme(legend.position="bottom",
axis.text=element_text(size=rel(1.5)),
axis.title=element_text(size=rel(1.5)),
legend.text=element_text(size=rel(1.1))) +
xlab("Year") + ylab("Number of Submissions") +
geom_point(aes(shape=type),size=2.5) +
guides(shape=guide_legend(nrow = 6))
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] zoo_1.7-11 ggplot2_1.0.1 lubridate_1.3.3
## [4] tidyr_0.2.0 dplyr_0.4.1 GEOmetadb_1.27.0
## [7] RSQLite_1.0.0 DBI_0.3.1 GEOquery_2.34.0
## [10] Biobase_2.27.1 BiocGenerics_0.14.0 pander_0.5.1
## [13] knitr_1.8
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.11.6 formatR_1.0 plyr_1.8.1 tools_3.2.1
## [5] digest_0.6.8 lattice_0.20-31 evaluate_0.5.5 memoise_0.2.1
## [9] gtable_0.1.2 yaml_2.1.13 proto_0.3-10 stringr_0.6.2
## [13] grid_3.2.1 R6_2.0.1 XML_3.98-1.1 rmarkdown_0.7
## [17] reshape2_1.4 magrittr_1.5 scales_0.2.4 codetools_0.2-11
## [21] htmltools_0.2.6 MASS_7.3-40 assertthat_0.1 colorspace_1.2-4
## [25] labeling_0.3 RCurl_1.95-4.3 lazyeval_0.1.10 munsell_0.4.2