1 インストール


1.1 Trinityインストール

  • Trinity RNA-Seq リードのアセンブラ

1.3 BUSCO インストール

  • BUSCO生物種間で普遍的な単一コピーオーソログを用いて、ゲノムアセンブリ、遺伝子セット、およびトランスクリプトームの完全性の評価
  • 依存プログラム(blast+, augustus, hmmer)をインストールしておく
  • augustus 遺伝子を予測
  • biocondaでインストールする
  • configディレクトリにあるconfig.ini.defaultを編集してconfig.iniを作る. 依存プログラムのpath等を書き換える. configディレクトリBUSCO_CONFIG_FILE環境変数

1.4 transrateインストール

  • translate トランスクリプトームアセンブリの品質解析。リファレンスを使用せずにアセンブリの品質を測定(ペアエンドデータのみ).
  • 依存関係をインストール. rvm, ruby, blast+, snap-aligner, bam-read, salmon
    • rvm(Ruby Version Manager)を使ってrubyをインストールする
    • snap-aligner, bam-read, salmonは./transrate-1.0.3-linux-x86_64/binの中にある
    • transrate --install-deps refで依存関係をインストールできる.
  • ubuntuで実行する場合、salmon/lib/librt.so.1及びtransrate/bin/librt.so.1を削除しないとtransrate --assembly contigs.fa --left left.fq --right right.fqが実行できない.

1.5 cd-hitインストール

  • cd-hit
  • psi-cd-hitを使うためにはlegacy BLASTが必要

1.6 TransDecoderインストール

  • TransDecoder ORFの予測および、blastやhmmscanの結果を基にしたコード領域を予測.

2 バージョン確認

## # Trinity --version
##  Trinity version: Trinity-v2.6.5 ** NOTE: Latest version of Trinity is Trinity-v2.8.5, and can be obtained at:   https://github.com/trinityrnaseq/trinityrnaseq/releases  
## # run_BUSCO.py --version
##  BUSCO 3.0.2 
## # augustus --version
##  AUGUSTUS (3.2.2) is a gene prediction tool written by M. Stanke, O. Keller, S. König, L. Gerischer and L. Romoth. 
## # transrate --version
##  1.0.3 
## # TransDecoder.LongOrfs --version
##  TransDecoder.LongOrfs 5.3.0

3 Trinityテスト

3.2 Trinity 結果

sample_trinity/
├── Trinity.fasta
├── Trinity.fasta.gene_trans_map
├── Trinity.timing
├── both.fa
├── both.fa.ok
├── both.fa.read_count
├── chrysalis
├── inchworm.K25.L25.fa
├── inchworm.K25.L25.fa.finished
├── inchworm.kmer_count
├── insilico_read_normalization
├── jellyfish.kmers.fa
├── jellyfish.kmers.fa.histo
├── left.fa.ok
├── partitioned_reads.files.list
├── partitioned_reads.files.list.ok
├── pipeliner.30442.cmds
├── read_partitions
├── recursive_trinity.cmds
├── recursive_trinity.cmds.completed
├── recursive_trinity.cmds.ok
├── right.fa.ok
└── scaffolding_entries.sam

3 directories, 20 files

3.3 TrinityStats.pl

  • Trinity.fastaの統計値を出力

3.4 TrinityStats結果

result of TrinityStats.pl
item value
Counts of transcripts, etc. NA
Total trinity ‘genes’ 45
Total trinity transcripts 71
Percent GC 49.58
Stats based on ALL transcript contigs NA
Contig N10 8756
Contig N20 8099
Contig N30 6692
Contig N40 5569
Contig N50 3918
Median contig length 592
Average contig 2035.63
Total assembled bases 144530
Stats based on ONLY LONGEST ISOFORM per ‘GENE’ NA
Contig N10 8816
Contig N20 8158
Contig N30 7194
Contig N40 6692
Contig N50 5399
Median contig length 324
Average contig 1409.04
Total assembled bases 63407

3.5 Trinity_gene_splice_modeler.pyを用いてSuperTranscriptsを構築する.

  • SuperTranscriptsについて
  • --incl_malignアラインメント結果
  • trinity_genes.fasta, trinity_genes.gtf, trinity_genes.malignができる.

4 Bridgerテスト

4.2 Bridger結果

./sample_test/test_bridger/
├── Assemble.log
├── Bridger.fasta
├── RawGraphs
├── path_search.commands
└── transcripts

2 directories, 3 files

5 BUSCOテスト

5.1 run_BUSCO.py実行

./run_TEST/
├── blast_output
├── full_table_TEST.tsv
├── hmmer_output
├── missing_busco_list_TEST.tsv
├── short_summary_TEST.txt
└── translated_proteins

3 directories, 3 files
short_summary
BUSOCs num
Complete BUSCOs (C) 4
Complete and single-copy BUSCOs (S) 4
Complete and duplicated BUSCOs (D) 0
Fragmented BUSCOs (F) 2
Missing BUSCOs (M) 4
Total BUSCO groups searched 10

6 transrateテスト

6.1 transrate --examples

## 
## Transrate v1.0.3
## 
## EXAMPLE COMMANDS:
## 
## # check dependencies and install any that are missing
## transrate --install-deps all
## 
## # get the transrate score for the assembly and each contig
## transrate --assembly contigs.fa --left left.fq --right right.fq
## 
## # basic assembly metrics only
## transrate --assembly contigs.fa
## 
## # basic and reference-based metrics with 8 threads
## transrate --assembly contigs.fa --reference ref.fa --threads 8
## 
## # contig and read-based metrics for two assemblies with 32 threads
## transrate --assembly one.fa,two.fa --left l.fq --right r.fq --threads 32

6.2 transrate --assembly実行

./example_data/example_res/
├── assemblies.csv
└── transcripts
    ├── assembly_score_optimisation.csv
    ├── aux
    │   ├── expected_bias.gz
    │   ├── fld.gz
    │   ├── meta_info.json
    │   └── observed_bias.gz
    ├── bad.transcripts.fa
    ├── cmd_info.json
    ├── contigs.csv
    ├── good.transcripts.fa
    ├── left.fq-right.fq-read_count.txt
    ├── left.fq.right.fq.transcripts.assigned.bam
    ├── left.fq.right.fq.transcripts.bam
    ├── libParams
    ├── logs
    │   ├── salmon.log
    │   └── snap.log
    ├── single_component_bad
    │   ├── cov.transcripts.fa
    │   ├── good.transcripts.fa
    │   ├── seg.transcripts.fa
    │   └── seq.transcripts.fa
    ├── transcripts
    │   ├── Genome
    │   ├── GenomeIndex
    │   ├── GenomeIndexHash
    │   └── OverflowTable
    ├── transcripts.fa_bam_info.csv
    └── transcripts.fa_quant.sf

6 directories, 25 files
assemblies.csv
assemblies transcripts.fa
n_seqs 15
smallest 849
largest 2396
n_bases 28562
mean_len 1904.13333
n_under_200 0
n_over_1k 14
n_over_10k 0
n_with_orf 15
mean_orf_percent 46.45683
n90 1612
n70 1681
n50 2037
n30 2288
n10 2385
gc 0.53067
bases_n 0
proportion_n 0.0
fragments 10000
fragments_mapped 10000
p_fragments_mapped 1.0
good_mappings 9994
p_good_mapping 0.9994
bad_mappings 6
potential_bridges 0
bases_uncovered 3059
p_bases_uncovered 0.1071
contigs_uncovbase 13
p_contigs_uncovbase 0.86667
contigs_uncovered 2
p_contigs_uncovered 0.13333
contigs_lowcovered 10
p_contigs_lowcovered 0.66667
contigs_segmented 0
p_contigs_segmented 0.0
score 0.66397
optimal_score 0.9032
cutoff 0.76522
weighted 60550.65257

7 cd-hit-est

7.1 cd-hit-est実行

  • *_cdh.fasta, *_cdh.fasta.clstr
./denovotc/
├── Trinity.fasta
├── Trinity_cdh.fasta
└── Trinity_cdh.fasta.clstr

0 directories, 3 files

7.2 Trinity_cdh.fasta.clstr

  • 配列名とクラスタ番号が記載されている
>Cluster 0
0   9376nt, >TRINITY_DN27508_c0_g1_i2... *
1   9375nt, >TRINITY_DN27508_c0_g1_i4... at 1:9375:1:9376/+/99.98%
>Cluster 1
0   9372nt, >TRINITY_DN27508_c0_g1_i3... at 1:9372:1:9374/+/100.00%
1   9374nt, >TRINITY_DN27508_c0_g1_i5... *

8 TransDecoder

8.1 ORF予測TransDecoder.LongOrfs

  • options
    • -t target transcripts.fasta
    • --gene_trans_map 遺伝子idとトランスクリプトidの対応表 (tab-delimited)
    • -m minimum protein length (default: 100)
    • -G genetic code (default: universal; options: Euplotes, Tetrahymena, Candida, Acetabularia)
    • -S strand-specific (only analyzes top strand)
  • *.transdecoder_dirに結果が出力される。
## 
## ########################################################################################
## #             ______                 ___                  __
## #            /_  __/______ ____ ___ / _ \___ _______  ___/ /__ ____
## #             / / / __/ _ `/ _\(_-</ // / -_) __/ _ \/ _  / -_) __/
## #            /_/ /_/ \_,_/_//_/___/____/\__/\__/\___/\_,_/\__/_/   .LongOrfs
## #                                                       
## ########################################################################################
## #
## #  Transdecoder.LongOrfs|http://transdecoder.github.io> - Transcriptome Protein Prediction
## #
## #
## #  Required:
## #
## #    -t <string>                            transcripts.fasta
## #
## #  Optional:
## #
## #   --gene_trans_map <string>              gene-to-transcript identifier mapping file (tab-delimited, gene_id<tab>trans_id<return> ) 
## #
## #   -m <int>                               minimum protein length (default: 100)
## # 
## #   -G <string>                            genetic code (default: universal; see PerlDoc; options: Euplotes, Tetrahymena, Candida, Acetabularia)
## #  
## #   -S                                     strand-specific (only analyzes top strand)
## #
## #   --genetic_code <string>                Universal (default)
## #
## #        Genetic Codes (derived from: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)
## #
## #
## Acetabularia
## Candida
## Ciliate
## Dasycladacean
## Euplotid
## Hexamita
## Mesodinium
## Mitochondrial-Ascidian
## Mitochondrial-Chlorophycean
## Mitochondrial-Echinoderm
## Mitochondrial-Flatworm
## Mitochondrial-Invertebrates
## Mitochondrial-Protozoan
## Mitochondrial-Pterobranchia
## Mitochondrial-Scenedesmus_obliquus
## Mitochondrial-Thraustochytrium
## Mitochondrial-Trematode
## Mitochondrial-Vertebrates
## Mitochondrial-Yeast
## Pachysolen_tannophilus
## Peritrich
## SR1_Gracilibacteria
## Tetrahymena
## Universal
## #
## #
## #   --version                              show version tag (5.3.0)
## #
## #########################################################################################
./Trinity_cdh.fasta.transdecoder_dir
├── base_freqs.dat
├── longest_orfs.cds
├── longest_orfs.gff3
└── longest_orfs.pep

0 directories, 4 files

8.3 blast及びhmmscanの結果を基に可能性のあるコード領域を予測するTransDecoder.Predict

  • --retain_pfam_hits--retain_blastp_hitsでファイル指定
  • Trandecoder.LongOrfsを実行したディレクトリでTransDecoder.Predictを実行する。TransDecoder.LongOrfsの出力ディレクトリ(コンティグ名.transdecoder_dir)がないと実行されない。
  • 必須ではないが RパッケージseqLogoがあるとモチーフのイメージを作成する。

8.3.2 hmmscan

  • hmmscanの結果はそのままではうまく読み込めない
result of hmmscan
target name accession query name accession_ E-value(full) score(full) bias(full) E-value(best 1 domain) score(best 1 domain) bias(best 1 domain) exp reg clu ov env dom rep inc description
DUF3450 PF11932.8 TRINITY_DN25_c0_g1_i1.p1 - 0.0047 16.3 12.0 0.0051 16.1 10.9 1.5 1 1 0 1 1 1 1 Protein of unknown function (DUF3450)
DUF1891 PF09004.10 TRINITY_DN25_c0_g1_i1.p1 - 0.02 14.5 1.0 0.062 12.9 0.1 2.2 2 0 0 2 2 2 0 Domain of unknown function (DUF1891)
DUF773 PF05600.12 TRINITY_DN25_c0_g1_i1.p1 - 0.076 11.7 9.2 0.086 11.6 9.2 1.1 1 0 0 1 1 1 0 CDK5 regulatory subunit-associated protein 3
PCRF PF03462.18 TRINITY_DN25_c0_g1_i1.p1 - 0.3 10.9 6.4 0.39 10.6 6.4 1.3 1 0 0 1 1 1 0 PCRF domain
PEP-utilisers_N PF05524.13 TRINITY_DN25_c0_g1_i1.p1 - 0.58 10.3 4.8 0.91 9.7 4.8 1.4 1 0 0 1 1 1 0 PEP-utilising enzyme, N-terminal
TMPIT PF07851.13 TRINITY_DN25_c0_g1_i1.p1 - 0.96 8.6 10.8 1.1 8.4 10.8 1.1 1 0 0 1 1 1 0 TMPIT-like protein

9 環境

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=ja_JP.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ja_JP.UTF-8        LC_COLLATE=ja_JP.UTF-8    
##  [5] LC_MONETARY=ja_JP.UTF-8    LC_MESSAGES=ja_JP.UTF-8   
##  [7] LC_PAPER=ja_JP.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=ja_JP.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.2.0    kableExtra_1.1.0 dplyr_0.8.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1          XVector_0.24.0      pillar_1.4.1       
##  [4] compiler_3.6.0      highr_0.8           zlibbioc_1.30.0    
##  [7] tools_3.6.0         digest_0.6.19       gtable_0.3.0       
## [10] evaluate_0.14       tibble_2.1.3        viridisLite_0.3.0  
## [13] pkgconfig_2.0.2     rlang_0.4.0         rstudioapi_0.10    
## [16] yaml_2.2.0          parallel_3.6.0      xfun_0.8           
## [19] withr_2.1.2         stringr_1.4.0       httr_1.4.0         
## [22] knitr_1.23          xml2_1.2.0          IRanges_2.18.1     
## [25] Biostrings_2.52.0   S4Vectors_0.22.0    hms_0.4.2          
## [28] grid_3.6.0          stats4_3.6.0        webshot_0.5.1      
## [31] tidyselect_0.2.5    glue_1.3.1          R6_2.4.0           
## [34] rmarkdown_1.13      readr_1.3.1         purrr_0.3.2        
## [37] magrittr_1.5        scales_1.0.0        htmltools_0.3.6    
## [40] BiocGenerics_0.30.0 assertthat_0.2.1    rvest_0.3.4        
## [43] colorspace_1.4-1    labeling_0.3        stringi_1.4.3      
## [46] lazyeval_0.2.2      munsell_0.5.0       crayon_1.3.4