read.bismark: part of the bsseq Bioconductor package

Peter Hickey (peter.hickey@gmail.com) (20/11/2012)

I use Bismark v0.7.7 and the test dataset downloadable from http://www.bioinformatics.babraham.ac.uk/projects/bismark/test_data.fastq. I follow the Bismark User Guide to align the reads and extract methylation information. I use v0.7.3 of the bsseq Bioconductor package.

Align the data to the reference genome and extract methylation calls

The test_data.fastq reads are single-end.

# Set the path of the Bismark genome_folder on your system
GENOME_FOLDER=${DB}/WGBS/genomes/hg19+lambda_phage/
# Download the reads for the test data set
wget http://www.bioinformatics.babraham.ac.uk/projects/bismark/test_data.fastq
# Align the data with Bismark
bismark -n 1 -l 50 ${GENOME_FOLDER} test_data.fastq
# Extract the methylation calls and summarise as bedGraphs
methylation_extractor -s --comprehensive test_data.fastq_bismark.sam
genome_methylation_bismark2bedGraph_v4.pl --counts CpG_context_test_data.fastq_bismark.txt > CpG_context_test_data.fastq_bismark.bedGraph
genome_methylation_bismark2bedGraph_v4.pl --counts CHG_context_test_data.fastq_bismark.txt > CHG_context_test_data.fastq_bismark.bedGraph
genome_methylation_bismark2bedGraph_v4.pl --counts CHH_context_test_data.fastq_bismark.txt > CHH_context_test_data.fastq_bismark.bedGraph

A few notes on each step

methylation_extractor

genome_methylation_bismark2bedGraph_v4.pl

NB: methylation_extractor can actually produce the bedGraph in a single-step, i.e. without requiring an explicit call to genome_methylation_bismark2bedGraph_v4.pl via bismark_methylation_extractor -s --comprehensive --bedGraph --counts test_data.fastq_bismark.sam. However, the current version of methylation_extractor has problems with chromosome names containing non-standard characters such as |, which is commonly used in FASTA contigs. Until this bug is fixed I recommend doing the two-step process.

Input to read.bismark

Input files are the output of genome_methylation_bismark2bedGraph_v4.pl --counts, e.g. CpG_context_test_data.fastq_bismark.bedGraph in our test data set. The tab-separated file has the following 6 columns:

chrom start end mPerc mCount uCount

NB: mPerc = mCount/(mCount + uCount) * 100 and start = end.

Here are the first few lines of CpG_context_test_data.fastq_bismark.bedGraph.

head CpG_context_test_data.fastq_bismark.bedGraph
chr1    910855  910855  100 1   0
chr1    910867  910867  100 1   0
chr1    910869  910869  100 1   0
chr1    2155925 2155925 100 1   0
chr1    2155927 2155927 100 1   0
chr1    2155952 2155952 100 1   0
chr1    2155958 2155958 100 1   0
chr1    2244658 2244658 0   0   1
chr1    9373960 9373960 100 1   0
chr1    9373972 9373972 100 1   0

Gzipped input files are accepted.

By default, genomic co-ordinates in Bismark are 1-based. bsseq also follows the 1-based convention, however 0-based are acceptable. Of course, it is not a good idea to mix and match these two different co-ordinate systems in the one analysis.

Output of read.bismark

Based on read.bssmooth(); either an object of class BSseq (if returnRaw = FALSE) or a list of GRanges which each component coming from a file (if returnRaw = TRUE).

Warnings

Example

This example uses the file CpG_context_test_data.fastq_bismark.bedGraph, which we created using Bismark at the beginning of this vignette. This file is also included as part of the bsseq package.

library(bsseq)
bismarkBedGraph <- system.file("data/CpG_context_test_data.fastq_bismark.bedGraph", 
    package = "bsseq")
bismarkBSseq <- read.bismark(files = bismarkBedGraph, sampleNames = "test_data", 
    returnRaw = FALSE, rmZeroCov = FALSE, verbose = TRUE)
## Reading file '/export/share/eva1users/allstaff/h/hickey/R/library/bsseq/data/CpG_context_test_data.fastq_bismark.bedGraph' ... in 0.2 secs
## Joining samples ... in 0.0 secs
bismarkBSseq
## An object of type 'BSseq' with
##   2013 methylation loci
##   1 samples
## has not been smoothed

sessionInfo()
## R version 2.15.2 (2012-10-26)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] bsseq_0.7.3          matrixStats_0.5.3    GenomicRanges_1.10.3
## [4] IRanges_1.16.4       BiocGenerics_0.4.0   knitr_0.8           
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.18.0     colorspace_1.2-0   dichromat_1.2-4   
##  [4] digest_0.5.2       evaluate_0.4.2     formatR_0.6       
##  [7] grid_2.15.2        labeling_0.1       lattice_0.20-10   
## [10] locfit_1.5-8       munsell_0.4        plyr_1.7.1        
## [13] RColorBrewer_1.0-5 R.methodsS3_1.4.2  scales_0.2.2      
## [16] stats4_2.15.2      stringr_0.6.1      tools_2.15.2