BioMartデータベースのリスト listMarts
- ホスト名http://ensemblgenomes.org/info/data_access で探す。
listMarts(host = "www.ensembl.org")
利用可能なBioMartデータベースを参照
- 繋がらない場合はミラーを選択する。
asia.ensembl.org
とか
- bacteriaはbiomartを使えない。
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 95
## 2 ENSEMBL_MART_MOUSE Mouse strains 95
## 3 ENSEMBL_MART_SNP Ensembl Variation 95
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 95
# ホスト名は以下の中から選択
hosts <- c("www.ensembl.org", "plants.ensembl.org", "fungi.ensembl.org",
"protists.ensembl.org", "metazoa.ensembl.org")
# 利用可能なBioMartデータベースを探す
marts <- Reduce(rbind, lapply(seq_along(hosts), function(i){
biomaRt::listMarts(host = hosts[i]) %>%
mutate(host_name= rep(hosts[i], nrow(.)))
}))
knitr::kable(marts, format = "pandoc", align = "l", caption="listMarts")
listMarts
ENSEMBL_MART_ENSEMBL |
Ensembl Genes 95 |
www.ensembl.org |
ENSEMBL_MART_MOUSE |
Mouse strains 95 |
www.ensembl.org |
ENSEMBL_MART_SNP |
Ensembl Variation 95 |
www.ensembl.org |
ENSEMBL_MART_FUNCGEN |
Ensembl Regulation 95 |
www.ensembl.org |
plants_mart |
Ensembl Plants Genes 42 |
plants.ensembl.org |
plants_variations |
Ensembl Plants Variations 42 |
plants.ensembl.org |
fungi_mart |
Ensembl Fungi Genes 42 |
fungi.ensembl.org |
fungi_variations |
Ensembl Fungi Variations 42 |
fungi.ensembl.org |
protists_mart |
Ensembl Protists Genes 42 |
protists.ensembl.org |
protists_variations |
Ensembl Protists Variations 42 |
protists.ensembl.org |
metazoa_mart |
Ensembl Metazoa Genes 42 |
metazoa.ensembl.org |
metazoa_variations |
Ensembl Metazoa Variations 42 |
metazoa.ensembl.org |
BioMartデータベース内で使用可能なデータセットを参照 listDatasets
- 最初にuseMartを使用してBioMartデータベースを選択してから、選択したBioMartで
listDatasets
を使用。
複数のホスト及びmartのdatasetのリストをあらかじめ作成しておく
marts <- c("ENSEMBL_MART_ENSEMBL", "plants_mart", "fungi_mart", "protists_mart", "metazoa_mart")
hosts <- c("www.ensembl.org", "plants.ensembl.org", "fungi.ensembl.org",
"protists.ensembl.org", "metazoa.ensembl.org" )
l_dsets <- mapply(function(x,y){
listDatasets(useMart(biomart = x, host = y), verbose = F)
}, x = marts, y = hosts, SIMPLIFY = F
)
dsets <- lapply(seq_along(l_dsets), function(i){
l_dsets[[i]] %>%
mutate(biomart=rep(marts[i], nrow(.)), host_name=rep(hosts[i], nrow(.))) %>%
dplyr::select(c(4,5,1:3))
}) %>%
{Reduce(rbind, .)}
knitr::kable(head(dsets), format = "pandoc", align = "l", caption="DataSet")
DataSet
ENSEMBL_MART_ENSEMBL |
www.ensembl.org |
acalliptera_gene_ensembl |
Eastern happy genes (fAstCal1.2) |
fAstCal1.2 |
ENSEMBL_MART_ENSEMBL |
www.ensembl.org |
acarolinensis_gene_ensembl |
Anole lizard genes (AnoCar2.0) |
AnoCar2.0 |
ENSEMBL_MART_ENSEMBL |
www.ensembl.org |
acitrinellus_gene_ensembl |
Midas cichlid genes (Midas_v5) |
Midas_v5 |
ENSEMBL_MART_ENSEMBL |
www.ensembl.org |
amelanoleuca_gene_ensembl |
Panda genes (ailMel1) |
ailMel1 |
ENSEMBL_MART_ENSEMBL |
www.ensembl.org |
amexicanus_gene_ensembl |
Cave fish genes (Astyanax_mexicanus-2.0) |
Astyanax_mexicanus-2.0 |
ENSEMBL_MART_ENSEMBL |
www.ensembl.org |
anancymaae_gene_ensembl |
Ma’s night monkey genes (Anan_2.0) |
Anan_2.0 |
データセットのリストから指定の生物種のmartを取得するuseMart
- 指定したBioMartデータベースとデータベース内のデータセットに接続
データベースからデータを取得する
指定したデータセットで利用可能な属性一覧listAttributes
Attributes of mart
ensembl_gene_id |
Gene stable ID |
feature_page |
ensembl_gene_id_version |
Gene stable ID version |
feature_page |
ensembl_transcript_id |
Transcript stable ID |
feature_page |
ensembl_transcript_id_version |
Transcript stable ID version |
feature_page |
ensembl_peptide_id |
Protein stable ID |
feature_page |
ensembl_peptide_id_version |
Protein stable ID version |
feature_page |
データベースから指定のデータを取得する getBM
- 以下の
attr_2
はbed形式に似せた形になる(実際には、5列目は’score’、)。
# selected attributes
atr_1 <- c("chromosome_name", "ensembl_transcript_id", "ensembl_peptide_id",
"external_gene_name", "description")
atr_2 = c("chromosome_name","start_position", "end_position",
"external_gene_name", "ensembl_transcript_id", "strand",
"cds_start", "cds_end", "description")
# getBM
dat1_cg_mart <- getBM(attributes = atr_1, mart = cg_mart)
# getBM with filters : protein codingのもの以外をfilterする(全部は時間がかかるので一部試す).
dat1_cg_mart %>%
filter(chromosome_name == "JH000064.1", ensembl_peptide_id !="") %>%
pull(ensembl_transcript_id) -> enst
dat2_cg_mart <- getBM(attributes = atr_2, filters = "ensembl_transcript_id",
values = enst, mart = cg_mart)
# 一部表示
knitr::kable(head(dat1_cg_mart), format="pandoc", align="l", caption="Attributes of mart(id retrieve)")
Attributes of mart(id retrieve)
MT |
ENSCGRT00000000001 |
|
|
|
MT |
ENSCGRT00000000002 |
|
|
|
MT |
ENSCGRT00000000003 |
|
|
|
MT |
ENSCGRT00000000004 |
|
|
|
MT |
ENSCGRT00000000005 |
|
|
|
MT |
ENSCGRT00000000006 |
ENSCGRP00000000001 |
ND1 |
NADH dehydrogenase subunit 1 [Source:NCBI gene;Acc:3979183] |
Attributes of mart(bed like)
JH000064.1 |
2214310 |
2216976 |
|
ENSCGRT00000009188 |
1 |
1 |
67 |
C-type lectin domain family 10 member A-like [Source:NCBI gene;Acc:100768594] |
JH000064.1 |
2214310 |
2216976 |
|
ENSCGRT00000009188 |
1 |
68 |
181 |
C-type lectin domain family 10 member A-like [Source:NCBI gene;Acc:100768594] |
JH000064.1 |
2214310 |
2216976 |
|
ENSCGRT00000009188 |
1 |
182 |
277 |
C-type lectin domain family 10 member A-like [Source:NCBI gene;Acc:100768594] |
JH000064.1 |
2214310 |
2216976 |
|
ENSCGRT00000009188 |
1 |
278 |
349 |
C-type lectin domain family 10 member A-like [Source:NCBI gene;Acc:100768594] |
JH000064.1 |
2214310 |
2216976 |
|
ENSCGRT00000009188 |
1 |
350 |
421 |
C-type lectin domain family 10 member A-like [Source:NCBI gene;Acc:100768594] |
JH000064.1 |
2214310 |
2216976 |
|
ENSCGRT00000009188 |
1 |
422 |
508 |
C-type lectin domain family 10 member A-like [Source:NCBI gene;Acc:100768594] |
配列を取得 getSequence
- “ENSEMBL_MART_ENSEMBL”でしか使えないらしい
seqType
は以下の中から選択’cdna’, ‘peptide’, ‘3utr’, ‘5utr’, ‘gene_exon’, ‘transcript_exon_intron’, ‘gene_exon_intron’, ‘coding’, ‘coding_transcript_flank’, ‘transcript_flank’, ‘gene_flank’
- idの並びが変わっていることに注意
- typeは
listFilters
でつかわれるフィルターのタイプで指定する。
# startとendを指定してその中にあるseqTypeのものを取得する
seq_1 <- getSequence(chromosome = "JH000064.1", start = 2214310, end = 2216976,
type="ensembl_transcript_id", seqType="cdna", mart=cg_mart)
# idを指定して
flt <- c("ENSCGRT00000009189","ENSCGRT00000009188","ENSCGRT00000009190")
seq_2 <- getSequence(id = flt, type = "ensembl_transcript_id", seqType="coding", mart=cg_mart)
seq_3 <- getSequence(id = flt, type = "ensembl_transcript_id", seqType="peptide", mart=cg_mart)
# pick the part of seq
suppressPackageStartupMessages(library(dplyr))
seq_1_tab <- seq_1 %>% mutate(cdna = substr(seq_1$cdna, 1, 50))
seq_2_tab <- seq_2 %>% mutate(coding = substr(seq_2$coding, 1, 50))
seq_3_tab <- seq_3 %>% mutate(peptide = substr(seq_3$peptide, 1, 50))
配列はデータフレームで返る
A
ATGACAATTACATACGAAAACTTCCAGAACTCAGGAATCGAGGAGAAAAA |
ENSCGRT00000009188 |
ATTACATACGAAAACTTCCAGAACTCAGGAATCGAGGAGAAAAACCCAGA |
ENSCGRT00000009189 |
TCTCTGGAGAGCACAGTGGAGAAAAAGGAACAGCAATTCAAAACAGGTCT |
ENSCGRT00000009190 |
B
ATTACATACGAAAACTTCCAGAACTCAGGAATCGAGGAGAAAAACCCAGA |
ENSCGRT00000009189 |
ATGACAATTACATACGAAAACTTCCAGAACTCAGGAATCGAGGAGAAAAA |
ENSCGRT00000009188 |
TCTCTGGAGAGCACAGTGGAGAAAAAGGAACAGCAATTCAAAACAGGTCT |
ENSCGRT00000009190 |
C
MTITYENFQNSGIEEKNPEIGKAAPPKSFLWDIFSWTRLLLFSLGLGLLL |
ENSCGRT00000009188 |
ITYENFQNSGIEEKNPEIGKAAPPKSFLWDIFSWTRLLLFSLGLGLLLLV |
ENSCGRT00000009189 |
SLESTVEKKEQQFKTGLSEITERVQELGKDLKALSCQLASLKNNGSAMAC |
ENSCGRT00000009190 |
環境
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 dplyr_0.7.8 biomaRt_2.36.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 highr_0.7 pillar_1.3.1
## [4] compiler_3.5.1 bindr_0.1.1 prettyunits_1.0.2
## [7] bitops_1.0-6 tools_3.5.1 progress_1.2.0
## [10] digest_0.6.18 bit_1.1-14 tibble_2.0.0
## [13] RSQLite_2.1.1 evaluate_0.12 memoise_1.1.0
## [16] pkgconfig_2.0.2 rlang_0.3.1 DBI_1.0.0
## [19] curl_3.2 yaml_2.2.0 parallel_3.5.1
## [22] stringr_1.3.1 httr_1.3.1 knitr_1.20
## [25] S4Vectors_0.18.3 IRanges_2.14.12 hms_0.4.2
## [28] tidyselect_0.2.5 stats4_3.5.1 rprojroot_1.3-2
## [31] bit64_0.9-7 glue_1.3.0 Biobase_2.40.0
## [34] R6_2.3.0 AnnotationDbi_1.42.1 XML_3.98-1.16
## [37] rmarkdown_1.10 purrr_0.2.5 blob_1.1.1
## [40] magrittr_1.5 backports_1.1.2 htmltools_0.3.6
## [43] BiocGenerics_0.26.0 assertthat_0.2.0 stringi_1.2.4
## [46] RCurl_1.95-4.11 crayon_1.3.4