`read.bismark`: part of the `bsseq` Bioconductor package

Peter Hickey (peter.hickey@gmail.com) (20/11/2012)

I use Bismark v0.7.7 and the test dataset downloadable from http://www.bioinformatics.babraham.ac.uk/projects/bismark/test_data.fastq. I follow the Bismark User Guide to align the reads and extract methylation information. I use v0.7.3 of the bsseq Bioconductor package.

Align the data to the reference genome and extract methylation calls

The test_data.fastq reads are single-end.

# Set the path of the Bismark genome_folder on your system
GENOME_FOLDER=${DB}/WGBS/genomes/hg19+lambda_phage/
# Download the reads for the test data set
wget http://www.bioinformatics.babraham.ac.uk/projects/bismark/test_data.fastq
# Align the data with Bismark
bismark -n 1 -l 50 ${GENOME_FOLDER} test_data.fastq
# Extract the methylation calls and summarise as bedGraphs
methylation_extractor -s --comprehensive test_data.fastq_bismark.sam
genome_methylation_bismark2bedGraph_v4.pl --counts CpG_context_test_data.fastq_bismark.txt > CpG_context_test_data.fastq_bismark.bedGraph
genome_methylation_bismark2bedGraph_v4.pl --counts CHG_context_test_data.fastq_bismark.txt > CHG_context_test_data.fastq_bismark.bedGraph
genome_methylation_bismark2bedGraph_v4.pl --counts CHH_context_test_data.fastq_bismark.txt > CHH_context_test_data.fastq_bismark.bedGraph

A few notes on each step

methylation_extractor

--comprehensive is required
--merge_non_CpG is optional

genome_methylation_bismark2bedGraph_v4.pl

--counts is required
If you are only interested in CpG-methylation then only process the CpG-specific files

NB: methylation_extractor can actually produce the bedGraph in a single-step, i.e. without requiring an explicit call to genome_methylation_bismark2bedGraph_v4.pl via bismark_methylation_extractor -s --comprehensive --bedGraph --counts test_data.fastq_bismark.sam. However, the current version of methylation_extractor has problems with chromosome names containing non-standard characters such as |, which is commonly used in FASTA contigs. Until this bug is fixed I recommend doing the two-step process.

Input to `read.bismark`

Input files are the output of genome_methylation_bismark2bedGraph_v4.pl --counts, e.g. CpG_context_test_data.fastq_bismark.bedGraph in our test data set. The tab-separated file has the following 6 columns:

chrom start end mPerc mCount uCount

NB: mPerc = mCount/(mCount + uCount) * 100 and start = end.

Here are the first few lines of CpG_context_test_data.fastq_bismark.bedGraph.

head CpG_context_test_data.fastq_bismark.bedGraph

chr1    910855  910855  100 1   0
chr1    910867  910867  100 1   0
chr1    910869  910869  100 1   0
chr1    2155925 2155925 100 1   0
chr1    2155927 2155927 100 1   0
chr1    2155952 2155952 100 1   0
chr1    2155958 2155958 100 1   0
chr1    2244658 2244658 0   0   1
chr1    9373960 9373960 100 1   0
chr1    9373972 9373972 100 1   0

Gzipped input files are accepted.

By default, genomic co-ordinates in Bismark are 1-based. bsseq also follows the 1-based convention, however 0-based are acceptable. Of course, it is not a good idea to mix and match these two different co-ordinate systems in the one analysis.

Output of `read.bismark`

Based on read.bssmooth(); either an object of class BSseq (if returnRaw = FALSE) or a list of GRanges which each component coming from a file (if returnRaw = TRUE).

Warnings

WARNING: Bismark's methylation extraction tools do not provide any option to filter by base quality (unlike the BSmooth pipeline). This is unfortunate but unavoidable and will not cause problems downstream. It might be possible to add this functionality to methylation_extractor.
WARNING: read.bismark() returns chromosome names exactly as they are in the Bismark output file, e.g. they could be of the form 'chr1, chr2, etc.' or they could be of the form '1, 2, etc.' or some other form.
WARNING: Strand information is lost when using genome_methylation_bismark2bedGraph.pl and is therefore not available in the object returned by read.bismark(). This is not a problem as bsseq does not make use of strand information.
WARNING: Bismark output is 1-based by default. Whilst bsseq is “index-agnostic”, it uses 1-based indexing by default. Therefore we strongly recommend that your Bismark output is 1-based.
WARNING: The Bismark output is lexicographically sorted by chromsome name, then by start co-ordinate, i.e. chromosome ordering is chr1, chr10, chr11, …, chr19, chr2, etc. and therefore the object returned by read.bismark() is “sorted” in this order as well. This not a problem but just an invonvience when viewing the BSseq object.

Example

This example uses the file CpG_context_test_data.fastq_bismark.bedGraph, which we created using Bismark at the beginning of this vignette. This file is also included as part of the bsseq package.

library(bsseq)
bismarkBedGraph <- system.file("data/CpG_context_test_data.fastq_bismark.bedGraph", 
    package = "bsseq")
bismarkBSseq <- read.bismark(files = bismarkBedGraph, sampleNames = "test_data", 
    returnRaw = FALSE, rmZeroCov = FALSE, verbose = TRUE)

## Reading file '/export/share/eva1users/allstaff/h/hickey/R/library/bsseq/data/CpG_context_test_data.fastq_bismark.bedGraph' ... in 0.2 secs
## Joining samples ... in 0.0 secs

bismarkBSseq

## An object of type 'BSseq' with
##   2013 methylation loci
##   1 samples
## has not been smoothed


sessionInfo()

## R version 2.15.2 (2012-10-26)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=C                 LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] bsseq_0.7.3          matrixStats_0.5.3    GenomicRanges_1.10.3
## [4] IRanges_1.16.4       BiocGenerics_0.4.0   knitr_0.8           
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.18.0     colorspace_1.2-0   dichromat_1.2-4   
##  [4] digest_0.5.2       evaluate_0.4.2     formatR_0.6       
##  [7] grid_2.15.2        labeling_0.1       lattice_0.20-10   
## [10] locfit_1.5-8       munsell_0.4        plyr_1.7.1        
## [13] RColorBrewer_1.0-5 R.methodsS3_1.4.2  scales_0.2.2      
## [16] stats4_2.15.2      stringr_0.6.1      tools_2.15.2

read.bismark: part of the bsseq Bioconductor package