Bioconductor 2020 Annotation Workshop

Introduction to Bioconductor Annotation Resources

Bioc 2020: 29-31 July

Instructors

Workshop Description

There are various annotation packages provided by the Bioconductor project that can be used to incorporate additional information to results from high-throughput experiments. This can be as simple as mapping Ensembl IDs to corresponding HUGO gene symbols, to much more complex queries involving multiple data sources. In this workshop we will cover the various classes of annotation packages, what they contain, and how to use them efficiently.

ハイスループット実験の結果に追加情報を組み込むために使用できる、 Bioconductorプロジェクトによって提供されるさまざまな アノテーションパッケージがあります。 これは、複数のデータソース含まれる非常に複雑なクエリを、 Ensembl IDを対応するHUGO遺伝子シンボルにマッピングするのと同じくらい簡単にします。 このワークショップでは、アノテーションパッケージのさまざまなクラス、 それらに含まれるもの、およびそれらを効率的に使用する方法について説明します。

Prerequisites

  • Basic knowledge of R syntax
  • Basic understanding of the various annotation sources (NCBI, EBI/EMBL)

Useful background reading

Workshop Participation

After each type of annotation package is introduced, students will be given the opportunity to practice making their own queries.

R / Bioconductor packages used

  • AnnotationDbi
  • AnnotationHub
  • BSgenome
  • biomaRt
  • ensembldb
  • org.Hs.eg.db
  • TxDb.Hsapiens.UCSC.hg19.knownGene
  • EnsDb.Hsapiens.v79
  • EnsDb.Mmusculus.v79
  • Homo.sapiens
  • BSgenome.Hsapiens.UCSC.hg19
  • hugene20sttranscriptcluster.db

Workshop goals and objectives

Annotating data is a complex task. For any high-throughput experiment the analyst usually starts with a set of identifiers for each thing that was measured, and in order to make the results useful to collaborators these identifiers need to be mapped to other identifiers that are either more familiar to collaborators, or that can be used for further analyses. As an example, RNA-Seq data may only have Entrez Gene IDs for each gene measured, and as part of the output you may want to include the gene symbols, which are more likely to be familiar to a Biologist.

データをアノテートすることは複雑なタスクです。 ハイスループット実験の場合、アナリストは通常、測定された各対象物のIDのセットからスタートします。 結果をコラボレーターにとって有用なものにするために、 これらのIDは、コラボレーターにとってより馴染みのある他のIDにマップする必要があり、 またそれがさらなる分析に使われ得ます。例えば、RNA-Seqデータは、測定された各遺伝子の Entrez Gene IDしか持たない場合があり、出力の一部としてその遺伝子シンボルを 含めたい場合があるでしょう。それらは生物学者に(訳注:IDよりも) よく知られている可能性が高いです。

Learning goals

  • Understand what sort of annotation data are available : 利用可能なアノテーションデータの種類を理解する
  • Understand the difference between annotation sources (NCBI and EBI/EMBL) : アノテーションソースの違い(NCBI と EBI/EMBL)を理解する
  • Gain familiarity with the various ways to query annotation packages : アノテーションパッケージをクエリするためのさまざまな方法に慣れる
  • Get some practice making queries : クエリを作る練習をする

Learning objectives

  • Be able to use select and mapIds to map between identifiers : ID間をマップするためにselectとmapIdsを使うことができる
  • Be able to extract data from TxDb and EnsDb packages : TxDbとEnsDbパッケージからデータを抽出できる
  • Be able to make queries using biomaRt : biomaRtを使用してクエリを作成できる
  • Extract and utilize various data from AnnotationHub : AnnotationHubからさまざまなデータを抽出し、利用する

Annotation Workshop

What do we mean by annotation? アノテーションとは何を意味するか?

Map a known ID to other functional or positional information

既知のIDを他の機能または位置情報にマップする

Specific goal 具体的な目標

We have data and statistics, and we want to add other useful information

我々は何らかのデータと統計量を持っており、他の有用な情報を追加したいとします。

The end result might be as simple as a data.frame or HTML table, or as complex as a RangedSummarizedExperiment

最終結果は、data.frameまたはHTMLテーブルのように単純な場合もあれば、 RangedSummarizedExperiment のように複雑な場合もあります。

Data containers

ExpressionSet

load(file.path(system.file(package = "Bioc2020Anno", "extdata"), "eset.Rdata"))
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 33552 features, 6 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM2194079 GSM2194080 ... GSM2194084 (6 total)
##   varLabels: title characteristics_ch1.1
##   varMetadata: labelDescription
## featureData
##   featureNames: 16657436 16657440 ... 17118478 (33552 total)
##   fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: pd.hugene.2.0.st

ExpressionSet (continued)

head(exprs(eset))
##          GSM2194079 GSM2194080 GSM2194081 GSM2194082 GSM2194083 GSM2194084
## 16657436   8.505158   9.046577   8.382674   9.115481   8.715343   8.566301
## 16657440   7.948860   8.191222   7.901911   8.459781   8.191793   8.219658
## 16657450  10.932934  11.228553  10.948120  11.462231  11.300046  11.300886
## 16657469   9.172462   9.344630   9.193450   9.465584   9.464020   9.135715
## 16657473   6.222049   6.551035   6.000246   6.398798   5.892654   5.592125
## 16657476   8.514300   8.474073   8.407196   8.811238   8.780833   8.874606
head(pData(phenoData(eset)))
##                          title characteristics_ch1.1
## GSM2194079   SW620-miR625-rep1     shRNA: miR-625-3p
## GSM2194080   SW620-miR625-rep2     shRNA: miR-625-3p
## GSM2194081   SW620-miR625-rep3     shRNA: miR-625-3p
## GSM2194082 SW620-scramble-rep1       shRNA: scramble
## GSM2194083 SW620-scramble-rep2       shRNA: scramble
## GSM2194084 SW620-scramble-rep3       shRNA: scramble

ExpressionSet (continued)

head(pData(featureData(eset)))
##           PROBEID  ENTREZID      SYMBOL
## 16657436 16657436     84771     DDX11L2
## 16657440 16657440 100302278   MIR1302-2
## 16657450 16657450    402483   LINC01000
## 16657469 16657469    140849 LINC00266-1
## 16657473 16657473    729759      OR4F29
## 16657476 16657476    388574   RPL23AP87
##                                                   GENENAME
## 16657436                     DEAD/H-box helicase 11 like 2
## 16657440                                   microRNA 1302-2
## 16657450       long intergenic non-protein coding RNA 1000
## 16657469      long intergenic non-protein coding RNA 266-1
## 16657473 olfactory receptor family 4 subfamily F member 29
## 16657476              ribosomal protein L23a pseudogene 87

BioC containers vs basic structures

Pros

  • Validity checking

  • Subsetting

  • Function dispatch

  • Automatic behaviors

Cons

  • Difficult to create : 作るのが難しい

  • Cumbersome to extract data by hand : 手作業でデータを抽出するのは面倒

  • Useful only within R : R内でのみ有用

Annotation sources

Package type Example
ChipDb hugene20sttranscriptcluster.db
OrgDb org.Hs.eg.db
TxDb/EnsDb TxDb.Hsapiens.UCSC.hg19.knownGene; EnsDb.Hsapiens.v75
OrganismDb Homo.sapiens
BSgenome BSgenome.Hsapiens.UCSC.hg19
Others GO.db; KEGG.db
AnnotationHub Online resource
biomaRt Online resource

Interacting with AnnoDb packages

The main function is select:

主な機能は select です:

select(annopkg, keys, columns, keytype)

Where ここで

Simple example 簡単な例

Say we have analyzed data from an Affymetrix Human Gene ST 2.0 array and want to know what the genes are. For purposes of this lab, we just select some IDs at random.

Affymetrix Human Gene ST 2.0アレイ由来のデータを分析し、遺伝子が何かを知りたいとします。 この実習では、ランダムにいくつかのIDだけをセレクトしています。

library(hugene20sttranscriptcluster.db)
set.seed(12345)
ids <- featureNames(eset)[sample(1:25000, 5)]
ids
## [1] "16858710" "17001289" "16894087" "16813565" "16951883"
select(hugene20sttranscriptcluster.db, ids, "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
##    PROBEID      SYMBOL
## 1 16858710        JUNB
## 2 17001289 PPP2R2B-IT1
## 3 16894087   LINC01247
## 4 16813565   NR2F2-AS1
## 5 16951883     OSBPL10

Questions!

How do you know what the central keys are? : どうしてセントラルキーが何であるかを知っているのですか?

More questions!

What keytypes or columns are available for a given annotation package?

あるアノテーションパッケージで使用できるkeytypeまたはcolumnは何かわかりますか?

keytypes(hugene20sttranscriptcluster.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIGENE"      "UNIPROT"
columns(hugene20sttranscriptcluster.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROBEID"      "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIGENE"      "UNIPROT"

Another example

There is one issue with select however.

ただし、select には1つ問題があります。

ids <- c('16737401','16657436' ,'16678303')
select(hugene20sttranscriptcluster.db, ids, c("SYMBOL","MAP"))
## 'select()' returned 1:many mapping between keys and columns
##     PROBEID       SYMBOL     MAP
## 1  16737401        TRAF6   11p12
## 2  16657436      DDX11L1 1p36.33
## 3  16657436 LOC102725121 1p36.33
## 4  16657436      DDX11L2  2q14.1
## 5  16657436      DDX11L9 15q26.3
## 6  16657436     DDX11L10 16p13.3
## 7  16657436      DDX11L5  9p24.3
## 8  16657436     DDX11L16    Xq28
## 9  16657436     DDX11L16    Yq12
## 10 16678303         ARF1 1q42.13

The mapIds function

An alternative to select is mapIds, which gives control of duplicates

select に代わる手段は mapIds です。mapIdsは重複をコントロールします。

mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID")
## 'select()' returned 1:many mapping between keys and columns
##  16737401  16657436  16678303 
##   "TRAF6" "DDX11L1"    "ARF1"

Choices for multiVals (multiValsの選択肢)

Default is first, where we just choose the first of the duplicates. Other choices are list, CharacterList, filter, asNA or a user-specified function.

デフォルトはfirstです。これは最初の重複だけを選択します。 その他の選択肢は、list, CharacterList, filter, asNAまたはユーザー指定の関数です。

mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "list")
## 'select()' returned 1:many mapping between keys and columns
## $`16737401`
## [1] "TRAF6"
## 
## $`16657436`
## [1] "DDX11L1"      "LOC102725121" "DDX11L2"      "DDX11L9"      "DDX11L10"    
## [6] "DDX11L5"      "DDX11L16"    
## 
## $`16678303`
## [1] "ARF1"

Choices for multiVals (continued)

mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "CharacterList")
## 'select()' returned 1:many mapping between keys and columns
## CharacterList of length 3
## [["16737401"]] TRAF6
## [["16657436"]] DDX11L1 LOC102725121 DDX11L2 DDX11L9 DDX11L10 DDX11L5 DDX11L16
## [["16678303"]] ARF1
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "filter")
## 'select()' returned 1:many mapping between keys and columns
## 16737401 16678303 
##  "TRAF6"   "ARF1"
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "asNA")
## 'select()' returned 1:many mapping between keys and columns
## 16737401 16657436 16678303 
##  "TRAF6"       NA   "ARF1"

ChipDb/OrgDb questions

Using either the hugene20sttranscriptcluster.db or org.Hs.eg.db package,

TxDb packages

TxDb packages contain positional information; the contents can be inferred by the package name

TxDbパッケージには位置情報が含まれています; その内容は パッケージ名から推測できます

TxDb.Species.Source.Build.Table

TxDb.Dmelanogaster.UCSC.dm3.ensGene TxDb.Athaliana.BioMart.plantsmart22

EnsDb packages

EnsDb packages are similar to TxDb packages, but based on Ensembl mappings

EnsDb パッケージは TxDb パッケージに似ていますが、Ensemblのマッピング 情報に基づいています

Transcript packages

As with ChipDb and OrgDb packages, select and mapIds can be used to make queries

ChipDbおよびOrgDbパッケージと同様に、クエリするのに selectmapIds が使えます。

select(TxDb.Hsapiens.UCSC.hg19.knownGene, c("1","10"),
       c("TXNAME","TXCHROM","TXSTART","TXEND"), "GENEID")
## 'select()' returned 1:many mapping between keys and columns
##   GENEID     TXNAME TXCHROM  TXSTART    TXEND
## 1      1 uc002qsd.4   chr19 58858172 58864865
## 2      1 uc002qsf.2   chr19 58859832 58874214
## 3     10 uc003wyw.1    chr8 18248755 18258723
select(EnsDb.Hsapiens.v79, c("1", "10"),
       c("GENEID","GENENAME","SEQNAME","GENESEQSTART","GENESEQEND"), "ENTREZID")
##   ENTREZID          GENEID GENENAME SEQNAME GENESEQSTART GENESEQEND
## 1        1 ENSG00000121410     A1BG      19     58345178   58353499
## 2       10 ENSG00000156006     NAT2       8     18391245   18401218

But this is not how one normally uses them…

しかし、これは通常の使用方法ではありません…

GRanges

The normal use case for transcript packages is to extract positional information into a GRanges or GRangesList object. An example is the genomic position of all genes:

transcriptパッケージの通常の使用例は、位置情報を GRanges もしくは GRangesList オブジェクトに抽出することです。 一例は、すべての遺伝子のゲノムでの位置です:

gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
gns
## GRanges object with 23056 ranges and 1 metadata column:
##         seqnames              ranges strand |     gene_id
##            <Rle>           <IRanges>  <Rle> | <character>
##       1    chr19   58858172-58874214      - |           1
##      10     chr8   18248755-18258723      + |          10
##     100    chr20   43248163-43280376      - |         100
##    1000    chr18   25530930-25757445      - |        1000
##   10000     chr1 243651535-244006886      - |       10000
##     ...      ...                 ...    ... .         ...
##    9991     chr9 114979995-115095944      - |        9991
##    9992    chr21   35736323-35743440      + |        9992
##    9993    chr22   19023795-19109967      - |        9993
##    9994     chr6   90539619-90584155      + |        9994
##    9997    chr22   50961997-50964905      - |        9997
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

GRangesList

Or the genomic position of all transcripts by gene:

または、遺伝子ごとのすべての転写産物のゲノム位置:

txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
txs
## GRangesList object of length 23459:
## $`1`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr19 58858172-58864865      - |     70455  uc002qsd.4
##   [2]    chr19 58859832-58874214      - |     70456  uc002qsf.2
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`10`
## GRanges object with 1 range and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr8 18248755-18258723      + |     31944  uc003wyw.1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`100`
## GRanges object with 1 range and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr20 43248163-43280376      - |     72132  uc002xmj.3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## ...
## <23456 more elements>

Other accessors

Why *Ranges objects

The main rationale for *Ranges objects is to allow us to easily select and subset data based on genomic position information. This is really powerful!

*Ranges オブジェクトの存在の根本的理由はゲノムの位置情報に基づいて データを簡単に選択およびサブセット化できるようにすることです。 これは本当に強力です!

GRanges and GRangesLists act like data.frames and lists, and can be subsetted using the [ function. As a really artificial example:

GRangesGRangesLists は data.frames や lists のように機能し、 [ 関数を使用してサブセット化できます。本当に人為的な例として:

txs[txs %over% gns[1:2,]]
## GRangesList object of length 3:
## $`1`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr19 58858172-58864865      - |     70455  uc002qsd.4
##   [2]    chr19 58859832-58874214      - |     70456  uc002qsf.2
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`10`
## GRanges object with 1 range and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr8 18248755-18258723      + |     31944  uc003wyw.1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`162968`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr19 58865723-58874214      - |     70457  uc002qsh.2
##   [2]    chr19 58865723-58874214      - |     70458  uc002qsi.2
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

*Ranges use cases

SummarizedExperiment objects

SummarizedExperiment objects are like ExpressionSets, but the row-wise annotations are GRanges, so you can subset by genomic locations:

SummarizedExperiment オブジェクトは ExpressionSets のようなものですが 、行ごとのアノテーションは GRanges なので、ゲノムの位置でサブセット化できます。

SummarizedExperiment objects are popular objects for representing expression data and other rectangular data (feature x sample data). Incoming packages are now strongly recommended to use this class representation instead of ExpressionSet.

SummarizedExperiment オブジェクトは、発現データやその他の長方形データ (機能 x サンプルデータ) を表すための一般的なオブジェクトです。 これからのパッケージには ExpressionSet の代わりにこのクラス表現を使うことを強くお勧めします。

TxDb exercises

OrganismDb packages

OrganismDb packages are meta-packages that contain an OrgDb, a TxDb, and a GO.db package and allow cross-queries between those packages.

OrganismDbパッケージは、OrgDb、TxDb、およびGO.dbパッケージを含み、 それらのパッケージ間のクロスクエリを可能にするメタパッケージです。

All previous accessors work; select, mapIds, transcripts, etc.

前述のすべてのアクセサーが機能します。selectmapIdstranscriptsなど。

library(Homo.sapiens)
Homo.sapiens
## OrganismDb Object:
## # Includes GODb Object:  GO.db 
## # With data about:  Gene Ontology 
## # Includes OrgDb Object:  org.Hs.eg.db 
## # Gene data about:  Homo sapiens 
## # Taxonomy Id:  9606 
## # Includes TxDb Object:  TxDb.Hsapiens.UCSC.hg19.knownGene 
## # Transcriptome data about:  Homo sapiens 
## # Based on genome:  hg19 
## # The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .

OrganismDb packages

head(genes(Homo.sapiens, columns = c("ENTREZID","ALIAS","UNIPROT")),4)
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## 'select()' returned 1:many mapping between keys and columns
## GRanges object with 4 ranges and 3 metadata columns:
##        seqnames            ranges strand |                 ALIAS
##           <Rle>         <IRanges>  <Rle> |       <CharacterList>
##      1    chr19 58858172-58874214      - |       A1B,ABG,GAB,...
##     10     chr8 18248755-18258723      + |   AAC2,NAT-2,PNAT,...
##    100    chr20 43248163-43280376      - |                   ADA
##   1000    chr18 25530930-25757445      - | CD325,CDHN,CDw325,...
##                         UNIPROT        ENTREZID
##                 <CharacterList> <CharacterList>
##      1            P04217,V9HWD8               1
##     10            A4Z6T7,P11245              10
##    100 A0A0S2Z381,P00813,F5GWI4             100
##   1000        P19022,A0A024RC42            1000
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

OrganismDb exercises

Organism.dplyr package

library(Organism.dplyr)

# src = src_organism("TxDb.Hsapiens.UCSC.hg19.knownGene")
src <- src_organism(dbpath = hg38light())
src
## src:  sqlite 3.30.1 [C:\Users\kozon\Documents\R\win-library\4.0\Organism.dplyr\extdata\light.hg38.knownGene.sqlite]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx

Organism.dplyr

Get promoters from a TxDb object (we use a small version)

TxDb オブジェクトからプロモーターを取得しましょう (小さいバージョンのものを使用します(訳注: ここは意図掴めず))

options(width = 120)
promoters(src)
## GRanges object with 96 ranges and 2 metadata columns:
##                                  seqnames              ranges strand |     tx_id            tx_name
##                                     <Rle>           <IRanges>  <Rle> | <integer>        <character>
##    ENST00000336199.9                 chr1 243843037-243845236      - |     19678  ENST00000336199.9
##    ENST00000366540.5                 chr1 243843385-243845584      - |     19679  ENST00000366540.5
##   ENST00000263826.11                 chr1 243843083-243845282      - |     19680 ENST00000263826.11
##    ENST00000673466.1                 chr1 243850044-243852243      - |     19681  ENST00000673466.1
##    ENST00000672460.1                 chr1 243664613-243666812      - |     19682  ENST00000672460.1
##                  ...                  ...                 ...    ... .       ...                ...
##    ENST00000646420.1 chr11_KZ559110v1_alt         38835-41034      + |    240550  ENST00000646420.1
##    ENST00000642929.1 chr11_KZ559110v1_alt         38852-41051      + |    240551  ENST00000642929.1
##    ENST00000644696.1 chr11_KZ559110v1_alt         38861-41060      + |    240552  ENST00000644696.1
##    ENST00000646644.1 chr11_KZ559110v1_alt         39187-41386      + |    240553  ENST00000646644.1
##    ENST00000645918.1 chr11_KZ559110v1_alt         66615-68814      + |    240554  ENST00000645918.1
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome

Organism.dplyr

Extract a table from the underlying database

基になるデータベースからテーブルを抽出する

tbl(src, "id")
## # Source:   table<id> [?? x 6]
## # Database: sqlite 3.30.1 []
##    entrez map      ensembl         symbol genename               alias   
##    <chr>  <chr>    <chr>           <chr>  <chr>                  <chr>   
##  1 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein A1B     
##  2 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein ABG     
##  3 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein GAB     
##  4 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein HYST2477
##  5 1      19q13.43 ENSG00000121410 A1BG   alpha-1-B glycoprotein A1BG    
##  6 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  AAC2    
##  7 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  NAT-2   
##  8 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  PNAT    
##  9 10     8p22     ENSG00000156006 NAT2   N-acetyltransferase 2  NAT2    
## 10 100    20q13.12 ENSG00000196839 ADA    adenosine deaminase    ADA     
## # ... with more rows

Organism.dplyr

Make a complex query between tables in the underlying database

基になるデータベースのテーブル間で複雑なクエリを作成する

inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>%
            filter(symbol %in% c("ADA", "NAT2")) %>%
            dplyr::select(gene_chrom, gene_start, gene_end,
            gene_strand, symbol, alias, map)
## Joining, by = "entrez"
## # Source:   lazy query [?? x 7]
## # Database: sqlite 3.30.1 []
##   gene_chrom gene_start gene_end gene_strand symbol alias map     
##   <chr>           <int>    <int> <chr>       <chr>  <chr> <chr>   
## 1 chr8         18391282 18401218 +           NAT2   AAC2  8p22    
## 2 chr8         18391282 18401218 +           NAT2   NAT-2 8p22    
## 3 chr8         18391282 18401218 +           NAT2   PNAT  8p22    
## 4 chr8         18391282 18401218 +           NAT2   NAT2  8p22    
## 5 chr20        44619522 44652233 -           ADA    ADA   20q13.12

Organism.dplyr exercises

BSgenome packages

BSgenome packages contain sequence information for a given species/build. There are many such packages - you can get a listing using available.genomes

BSgenomeパッケージには、特定の種/ビルドのシーケンス情報が含まれています。 そのようなパッケージはたくさんあります - available.genomes を使用して そのリストを取得できます。

library(BSgenome)
head(available.genomes())
## [1] "BSgenome.Alyrata.JGI.v1"                 "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.UCSC.apiMel2"        "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [5] "BSgenome.Aofficinalis.NCBI.V1"           "BSgenome.Athaliana.TAIR.04232008"

BSgenome packages

We can load and inspect a BSgenome package

BSgenomeパッケージを読み込んで調査できます

library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens
## Human genome:
## # organism: Homo sapiens (Human)
## # genome: hg19
## # provider: UCSC
## # release date: June 2013
## # 298 sequences:
## #   chr1                  chr2                  chr3                  chr4                  chr5                 
## #   chr6                  chr7                  chr8                  chr9                  chr10                
## #   chr11                 chr12                 chr13                 chr14                 chr15                
## #   chr16                 chr17                 chr18                 chr19                 chr20                
## #   chr21                 chr22                 chrX                  chrY                  chrM                 
## #   ...                   ...                   ...                   ...                   ...                  
## #   chr18_gl383570_alt    chr18_gl383571_alt    chr18_gl383572_alt    chr19_gl383573_alt    chr19_gl383574_alt   
## #   chr19_gl383575_alt    chr19_gl383576_alt    chr19_gl949746_alt    chr19_gl949747_alt    chr19_gl949748_alt   
## #   chr19_gl949749_alt    chr19_gl949750_alt    chr19_gl949751_alt    chr19_gl949752_alt    chr19_gl949753_alt   
## #   chr20_gl383577_alt    chr21_gl383578_alt    chr21_gl383579_alt    chr21_gl383580_alt    chr21_gl383581_alt   
## #   chr22_gl383582_alt    chr22_gl383583_alt    chr22_kb663609_alt                                               
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given sequence)

BSgenome packages

The main accessor is getSeq, and you can get data by sequence (e.g., entire chromosome or unplaced scaffold), or by passing in a GRanges object, to get just a region.

メインのアクセサーはgetSeqであり、 シーケンス (たとえば、染色体全体またはそこに配置されていないスキャフォールド) ごとに、またはGRangesオブジェクトを渡すことによって、データを取得することで ある領域のみを取得できます。

getSeq(Hsapiens, "chr1")
## 249250621-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
getSeq(Hsapiens, gns["5467",])
## DNAStringSet object of length 1:
##     width seq                                                                                       names               
## [1] 85634 GCGGAGCGTGTGACGCTGCGGCCGCCGCGGACCTGGGGATTAA...ACTTTAAATAAATGGGAATTAAATATTTAAGAGCTGACTGGAA 5467

The Biostrings package contains most of the code for dealing with these *StringSet objects - please see the Biostrings vignettes and help pages for more information.

Biostringsパッケージには、これらの *StringSet オブジェクトを処理するための ほとんどのコードが含まれています。 詳細については、Biostringsのビネットとヘルプページを参照してください。

BSgenome exercises

AnnotationHub

AnnotationHub is a package that allows us to query and download many different annotation objects, without having to explicitly install them.

AnnotationHubは、明示的にインストールしなくても、 さまざまなアノテーションオブジェクトをクエリしてダウンロードできるように してくれるパッケージです。

library(AnnotationHub)
hub <- AnnotationHub()
## snapshotDate(): 2020-09-03
hub
## AnnotationHub with 50528 records
## # snapshotDate(): 2020-09-03
## # $dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, Haemcode, FungiDB, Inparanoid...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus, Pan troglodytes, Rattus norvegicus, Dan...
## # $rdataclass: GRanges, BigWigFile, TwoBitFile, Rle, EnsDb, OrgDb, ChainFile, TxDb, Inparanoid8Db, data.frame
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded,
## #   preparerclass, tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##             title                                           
##   AH5012  | Chromosome Band                                 
##   AH5013  | STS Markers                                     
##   AH5014  | FISH Clones                                     
##   AH5015  | Recomb Rate                                     
##   AH5016  | ENCODE Pilot                                    
##   ...       ...                                             
##   AH83360 | Ensembl 101 EnsDb for Zonotrichia albicollis    
##   AH83361 | Ensembl 101 EnsDb for Zalophus californianus    
##   AH83362 | Sequences of snoRNA targets of Homo sapiens hg38
##   AH83363 | EpiTxDb snoRNAdb for Mus musculus mm10          
##   AH83364 | Sequences of snoRNA targets of Mus musculus mm10

Querying AnnotationHub

Finding the ‘right’ resource on AnnotationHub is like using Google - a well posed query is necessary to find what you are after. Useful queries are based on

AnnotationHubで ‘適切な’ リソースを見つけることは、Googleを使うようなものです。 目的物を見つけるには、適切なクエリが必要です。有用なクエリは

names(mcols(hub))
##  [1] "title"              "dataprovider"       "species"            "taxonomyid"         "genome"            
##  [6] "description"        "coordinate_1_based" "maintainer"         "rdatadateadded"     "preparerclass"     
## [11] "tags"               "rdataclass"         "rdatapath"          "sourceurl"          "sourcetype"

AnnotationHub Data providers

unique(hub$dataprovider)
##  [1] "UCSC"                                                "Ensembl"                                            
##  [3] "RefNet"                                              "Inparanoid8"                                        
##  [5] "NHLBI"                                               "ChEA"                                               
##  [7] "Pazar"                                               "NIH Pathway Interaction Database"                   
##  [9] "Haemcode"                                            "BroadInstitute"                                     
## [11] "PRIDE"                                               "Gencode"                                            
## [13] "CRIBI"                                               "Genoscope"                                          
## [15] "MISO, VAST-TOOLS, UCSC"                              "UWashington"                                        
## [17] "Stanford"                                            "dbSNP"                                              
## [19] "BioMart"                                             "GeneOntology"                                       
## [21] "KEGG"                                                "URGI"                                               
## [23] "EMBL-EBI"                                            "MicrosporidiaDB"                                    
## [25] "FungiDB"                                             "TriTrypDB"                                          
## [27] "ToxoDB"                                              "AmoebaDB"                                           
## [29] "PlasmoDB"                                            "PiroplasmaDB"                                       
## [31] "CryptoDB"                                            "TrichDB"                                            
## [33] "GiardiaDB"                                           "The Gene Ontology Consortium"                       
## [35] "ENCODE Project"                                      "SchistoDB"                                          
## [37] "NCBI/UniProt"                                        "GENCODE"                                            
## [39] "http://www.pantherdb.org"                            "RMBase v2.0"                                        
## [41] "snoRNAdb"                                            "tRNAdb"                                             
## [43] "NCBI"                                                "DrugAge, DrugBank, Broad Institute, Broad Institute"
## [45] "DrugAge"                                             "DrugBank"                                           
## [47] "Broad Institute"                                     "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"              
## [49] "HMDB, EMBL-EBI, EPA"                                 "STRING"                                             
## [51] "OMA"                                                 "OrthoDB"

AnnotationHub Data classes

unique(hub$rdataclass)
##  [1] "GRanges"                           "data.frame"                        "Inparanoid8Db"                    
##  [4] "TwoBitFile"                        "ChainFile"                         "SQLiteConnection"                 
##  [7] "biopax"                            "BigWigFile"                        "AAStringSet"                      
## [10] "MSnSet"                            "mzRpwiz"                           "mzRident"                         
## [13] "list"                              "TxDb"                              "Rle"                              
## [16] "EnsDb"                             "VcfFile"                           "igraph"                           
## [19] "data.frame, DNAStringSet, GRanges" "sqlite"                            "data.table"                       
## [22] "character"                         "SQLite"                            "SQLiteFile"                       
## [25] "OrgDb"                             "Tibble"                            "Rda"                              
## [28] "FaFile"

AnnotationHub Species

head(unique(hub$species))
## [1] "Homo sapiens"         "Vicugna pacos"        "Dasypus novemcinctus" "Otolemur garnettii"   "Papio hamadryas"     
## [6] "Papio anubis"
length(unique(hub$species))
## [1] 2126

AnnotationHub Data sources

unique(hub$sourcetype)
##  [1] "UCSC track"   "GTF"          "TSV"          "Inparanoid"   "TwoBit"       "Chain"        "GRASP"       
##  [8] "Zip"          "CSV"          "BioPax"       "BioPaxLevel2" "RData"        "BED"          "BigWig"      
## [15] "tab"          "FASTA"        "mzTab"        "mzML"         "mzid"         "GFF"          "ensembl"     
## [22] "VCF"          "TXT"          "JSON"         "XML"          "XLS/XLSX"     "NCBI/ensembl" "NCBI/UniProt"

AnnotationHub query

qry <- query(hub, c("granges","homo sapiens","ensembl"))
qry
## AnnotationHub with 84 records
## # snapshotDate(): 2020-09-03
## # $dataprovider: Ensembl, UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded,
## #   preparerclass, tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5046"]]' 
## 
##             title                                           
##   AH5046  | Ensembl Genes                                   
##   AH5160  | Ensembl Genes                                   
##   AH5311  | Ensembl Genes                                   
##   AH5434  | Ensembl Genes                                   
##   AH5435  | Ensembl EST Genes                               
##   ...       ...                                             
##   AH79161 | Homo_sapiens.GRCh38.99.gtf                      
##   AH80074 | Homo_sapiens.GRCh38.100.abinitio.gtf            
##   AH80075 | Homo_sapiens.GRCh38.100.chr.gtf                 
##   AH80076 | Homo_sapiens.GRCh38.100.chr_patch_hapl_scaff.gtf
##   AH80077 | Homo_sapiens.GRCh38.100.gtf

AnnotationHub query

qry$sourceurl
##  [1] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/ensGene"                                    
##  [2] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg18/database/ensGene"                                    
##  [3] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg17/database/ensGene"                                    
##  [4] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg16/database/ensGene"                                    
##  [5] "rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg16/database/ensEstGene"                                 
##  [6] "ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz"                       
##  [7] "ftp://ftp.ensembl.org/pub/release-69/gtf/homo_sapiens/Homo_sapiens.GRCh37.69.gtf.gz"                       
##  [8] "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"                       
##  [9] "ftp://ftp.ensembl.org/pub/release-72/gtf/homo_sapiens/Homo_sapiens.GRCh37.72.gtf.gz"                       
## [10] "ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens/Homo_sapiens.GRCh37.73.gtf.gz"                       
## [11] "ftp://ftp.ensembl.org/pub/release-74/gtf/homo_sapiens/Homo_sapiens.GRCh37.74.gtf.gz"                       
## [12] "ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz"                       
## [13] "ftp://ftp.ensembl.org/pub/release-78/gtf/homo_sapiens/Homo_sapiens.GRCh38.78.gtf.gz"                       
## [14] "ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz"                       
## [15] "ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/Homo_sapiens.GRCh38.79.gtf.gz"                       
## [16] "ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz"                       
## [17] "ftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sapiens.GRCh38.80.gtf.gz"                       
## [18] "ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz"                       
## [19] "ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz"                       
## [20] "ftp://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz"                       
## [21] "ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.abinitio.gtf.gz"              
## [22] "ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr.gtf.gz"                   
## [23] "ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz"  
## [24] "ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz"                       
## [25] "ftp://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.abinitio.gtf.gz"              
## [26] "ftp://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.chr.gtf.gz"                   
## [27] "ftp://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.chr_patch_hapl_scaff.gtf.gz"  
## [28] "ftp://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.gtf.gz"                       
## [29] "ftp://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.abinitio.gtf.gz"              
## [30] "ftp://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.chr.gtf.gz"                   
## [31] "ftp://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.chr_patch_hapl_scaff.gtf.gz"  
## [32] "ftp://ftp.ensembl.org/pub/release-86/gtf/homo_sapiens/Homo_sapiens.GRCh38.86.gtf.gz"                       
## [33] "ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.abinitio.gtf.gz"              
## [34] "ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr.gtf.gz"                   
## [35] "ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz"  
## [36] "ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz"                       
## [37] "ftp://ftp.ensembl.org/pub/release-88/gtf/homo_sapiens/Homo_sapiens.GRCh38.88.abinitio.gtf.gz"              
## [38] "ftp://ftp.ensembl.org/pub/release-88/gtf/homo_sapiens/Homo_sapiens.GRCh38.88.chr.gtf.gz"                   
## [39] "ftp://ftp.ensembl.org/pub/release-88/gtf/homo_sapiens/Homo_sapiens.GRCh38.88.chr_patch_hapl_scaff.gtf.gz"  
## [40] "ftp://ftp.ensembl.org/pub/release-88/gtf/homo_sapiens/Homo_sapiens.GRCh38.88.gtf.gz"                       
## [41] "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.abinitio.gtf.gz"              
## [42] "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.chr.gtf.gz"                   
## [43] "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.chr_patch_hapl_scaff.gtf.gz"  
## [44] "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.gtf.gz"                       
## [45] "ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.abinitio.gtf.gz"              
## [46] "ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.chr.gtf.gz"                   
## [47] "ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.chr_patch_hapl_scaff.gtf.gz"  
## [48] "ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz"                       
## [49] "ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.abinitio.gtf.gz"              
## [50] "ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.chr.gtf.gz"                   
## [51] "ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf.gz"  
## [52] "ftp://ftp.ensembl.org/pub/release-91/gtf/homo_sapiens/Homo_sapiens.GRCh38.91.gtf.gz"                       
## [53] "ftp://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/Homo_sapiens.GRCh38.92.abinitio.gtf.gz"              
## [54] "ftp://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/Homo_sapiens.GRCh38.92.chr.gtf.gz"                   
## [55] "ftp://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/Homo_sapiens.GRCh38.92.chr_patch_hapl_scaff.gtf.gz"  
## [56] "ftp://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/Homo_sapiens.GRCh38.92.gtf.gz"                       
## [57] "ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.GRCh38.93.abinitio.gtf.gz"              
## [58] "ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.GRCh38.93.chr.gtf.gz"                   
## [59] "ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.GRCh38.93.chr_patch_hapl_scaff.gtf.gz"  
## [60] "ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.GRCh38.93.gtf.gz"                       
## [61] "ftp://ftp.ensembl.org/pub/release-94/gtf/homo_sapiens/Homo_sapiens.GRCh38.94.abinitio.gtf.gz"              
## [62] "ftp://ftp.ensembl.org/pub/release-94/gtf/homo_sapiens/Homo_sapiens.GRCh38.94.chr.gtf.gz"                   
## [63] "ftp://ftp.ensembl.org/pub/release-94/gtf/homo_sapiens/Homo_sapiens.GRCh38.94.chr_patch_hapl_scaff.gtf.gz"  
## [64] "ftp://ftp.ensembl.org/pub/release-94/gtf/homo_sapiens/Homo_sapiens.GRCh38.94.gtf.gz"                       
## [65] "ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.abinitio.gtf.gz"              
## [66] "ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.chr.gtf.gz"                   
## [67] "ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.chr_patch_hapl_scaff.gtf.gz"  
## [68] "ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz"                       
## [69] "ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.abinitio.gtf.gz"              
## [70] "ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.chr.gtf.gz"                   
## [71] "ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.chr_patch_hapl_scaff.gtf.gz"  
## [72] "ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz"                       
## [73] "ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.abinitio.gtf.gz"              
## [74] "ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.chr.gtf.gz"                   
## [75] "ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.chr_patch_hapl_scaff.gtf.gz"  
## [76] "ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.gtf.gz"                       
## [77] "ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.abinitio.gtf.gz"              
## [78] "ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.chr.gtf.gz"                   
## [79] "ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.chr_patch_hapl_scaff.gtf.gz"  
## [80] "ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz"                       
## [81] "ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.abinitio.gtf.gz"            
## [82] "ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.chr.gtf.gz"                 
## [83] "ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.chr_patch_hapl_scaff.gtf.gz"
## [84] "ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz"

Selecting AnnotationHub resource

whatIwant <- qry[["AH80077"]]

We can use these data as they are, or convert to a TxDb format:

これらのデータをそのまま使用することも、TxDbフォーマットに変換することもできます:

GRCh38TxDb <- makeTxDbFromGRanges(whatIwant)
GRCh38TxDb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Genome: GRCh38.p13
## # Nb of transcripts: 227887
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2020-09-09 17:15:00 +0900 (Wed, 09 Sep 2020)
## # GenomicFeatures version at creation time: 1.41.3
## # RSQLite version at creation time: 2.2.0
## # DBSCHEMAVERSION: 1.2

AnnotationHub exercises

biomaRt

The biomaRt package allows queries to an Ensembl Biomart server. We can see the choices of servers that we can use:

biomaRtパッケージは、Ensembl Biomartサーバーへのクエリを可能にします。 下記で使用できるサーバーの選択肢を確認できます:

library(biomaRt)
listMarts(host = "useast.ensembl.org")

               biomart                version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 100
2   ENSEMBL_MART_MOUSE      Mouse strains 100
3     ENSEMBL_MART_SNP  Ensembl Variation 100
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 100

biomaRt data sets

And we can then check for the available data sets on a particular server.

そして、特定のサーバーで利用可能なデータセットを確認できます。

mart <- useEnsembl("ensembl")
head(listDatasets(mart))

                     dataset                                  description
1   acalliptera_gene_ensembl             Eastern happy genes (fAstCal1.2)
2 acarolinensis_gene_ensembl               Anole lizard genes (AnoCar2.0)
3  acchrysaetos_gene_ensembl              Golden eagle genes (bAquChr1.2)
4  acitrinellus_gene_ensembl               Midas cichlid genes (Midas_v5)
5  amelanoleuca_gene_ensembl                        Panda genes (ailMel1)
6    amexicanus_gene_ensembl Mexican tetra genes (Astyanax_mexicanus-2.0)
                 version
1             fAstCal1.2
2              AnoCar2.0
3             bAquChr1.2
4               Midas_v5
5                ailMel1
6 Astyanax_mexicanus-2.0
> 

biomaRt queries

After setting up a mart object pointing to the server and data set that we care about, we can make queries. We first set up the mart object.

対象のサーバーとデータセットを指す mart オブジェクトを設定したら、 クエリを実行できます。最初に mart オブジェクトを設定します。

mart <- useEnsembl("ensembl","hsapiens_gene_ensembl"                   )

Queries are of the form

getBM(attributes, filters, values, mart)

where

biomaRt attributes and filters

Both attributes and filters have rather inscrutable names, but a listing can be accessed using

attributes と filters はどちらもかなりわかりにくい名前を持っていますが、 リストには以下を使用してアクセスできます

atrib <- listAttributes(mart)
filts <- listFilters(mart)
head(atrib)
                           name                  description         page
1               ensembl_gene_id               Gene stable ID feature_page
2       ensembl_gene_id_version       Gene stable ID version feature_page
3         ensembl_transcript_id         Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5            ensembl_peptide_id            Protein stable ID feature_page
6    ensembl_peptide_id_version    Protein stable ID version feature_page
head(filts)
                name                            description
1    chromosome_name               Chromosome/scaffold name
2              start                                  Start
3                end                                    End
4             strand                                 Strand
5 chromosomal_region e.g. 1:100:10000:-1, 1:100000:200000:1
6          with_ccds                        With CCDS ID(s)

biomaRt query

A simple example query

簡単なクエリ例

afyids <- c("1000_at","1001_at","1002_f_at","1007_s_at")
getBM(c("affy_hg_u95av2", "hgnc_symbol"), c("affy_hg_u95av2"), afyids, mart)

  affy_hg_u95av2 hgnc_symbol
1        1000_at       MAPK3
2      1007_s_at        DDR1
3        1001_at        TIE1
4      1002_f_at            
5      1002_f_at     CYP2C19

biomaRt exercises