read.bismark: part of the bsseq Bioconductor packageI use Bismark v0.7.7 and the test dataset downloadable from http://www.bioinformatics.babraham.ac.uk/projects/bismark/test_data.fastq. I follow the Bismark User Guide to align the reads and extract methylation information. I use v0.7.3 of the bsseq Bioconductor package.
The test_data.fastq reads are single-end.
# Set the path of the Bismark genome_folder on your system
GENOME_FOLDER=${DB}/WGBS/genomes/hg19+lambda_phage/
# Download the reads for the test data set
wget http://www.bioinformatics.babraham.ac.uk/projects/bismark/test_data.fastq
# Align the data with Bismark
bismark -n 1 -l 50 ${GENOME_FOLDER} test_data.fastq
# Extract the methylation calls and summarise as bedGraphs
methylation_extractor -s --comprehensive test_data.fastq_bismark.sam
genome_methylation_bismark2bedGraph_v4.pl --counts CpG_context_test_data.fastq_bismark.txt > CpG_context_test_data.fastq_bismark.bedGraph
genome_methylation_bismark2bedGraph_v4.pl --counts CHG_context_test_data.fastq_bismark.txt > CHG_context_test_data.fastq_bismark.bedGraph
genome_methylation_bismark2bedGraph_v4.pl --counts CHH_context_test_data.fastq_bismark.txt > CHH_context_test_data.fastq_bismark.bedGraph
methylation_extractor
--comprehensive is required--merge_non_CpG is optionalgenome_methylation_bismark2bedGraph_v4.pl
--counts is requiredNB: methylation_extractor can actually produce the bedGraph in a single-step, i.e. without requiring an explicit call to genome_methylation_bismark2bedGraph_v4.pl via bismark_methylation_extractor -s --comprehensive --bedGraph --counts test_data.fastq_bismark.sam. However, the current version of methylation_extractor has problems with chromosome names containing non-standard characters such as |, which is commonly used in FASTA contigs. Until this bug is fixed I recommend doing the two-step process.
read.bismarkInput files are the output of genome_methylation_bismark2bedGraph_v4.pl --counts, e.g. CpG_context_test_data.fastq_bismark.bedGraph in our test data set. The tab-separated file has the following 6 columns:
chrom start end mPerc mCount uCount
NB: mPerc = mCount/(mCount + uCount) * 100 and start = end.
Here are the first few lines of CpG_context_test_data.fastq_bismark.bedGraph.
head CpG_context_test_data.fastq_bismark.bedGraph
chr1 910855 910855 100 1 0
chr1 910867 910867 100 1 0
chr1 910869 910869 100 1 0
chr1 2155925 2155925 100 1 0
chr1 2155927 2155927 100 1 0
chr1 2155952 2155952 100 1 0
chr1 2155958 2155958 100 1 0
chr1 2244658 2244658 0 0 1
chr1 9373960 9373960 100 1 0
chr1 9373972 9373972 100 1 0
Gzipped input files are accepted.
By default, genomic co-ordinates in Bismark are 1-based. bsseq also follows the 1-based convention, however 0-based are acceptable. Of course, it is not a good idea to mix and match these two different co-ordinate systems in the one analysis.
read.bismarkBased on read.bssmooth(); either an object of class BSseq (if returnRaw = FALSE) or a list of GRanges which each component coming from a file (if returnRaw = TRUE).
methylation_extractor.read.bismark() returns chromosome names exactly as they are in the Bismark output file, e.g. they could be of the form 'chr1, chr2, etc.' or they could be of the form '1, 2, etc.' or some other form.genome_methylation_bismark2bedGraph.pl and is therefore not available in the object returned by read.bismark(). This is not a problem as bsseq does not make use of strand information.bsseq is “index-agnostic”, it uses 1-based indexing by default. Therefore we strongly recommend that your Bismark output is 1-based.This example uses the file CpG_context_test_data.fastq_bismark.bedGraph, which we created using Bismark at the beginning of this vignette. This file is also included as part of the bsseq package.
library(bsseq)
bismarkBedGraph <- system.file("data/CpG_context_test_data.fastq_bismark.bedGraph",
package = "bsseq")
bismarkBSseq <- read.bismark(files = bismarkBedGraph, sampleNames = "test_data",
returnRaw = FALSE, rmZeroCov = FALSE, verbose = TRUE)
## Reading file '/export/share/eva1users/allstaff/h/hickey/R/library/bsseq/data/CpG_context_test_data.fastq_bismark.bedGraph' ... in 0.2 secs
## Joining samples ... in 0.0 secs
bismarkBSseq
## An object of type 'BSseq' with
## 2013 methylation loci
## 1 samples
## has not been smoothed
sessionInfo()
## R version 2.15.2 (2012-10-26)
## Platform: x86_64-unknown-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=C LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] bsseq_0.7.3 matrixStats_0.5.3 GenomicRanges_1.10.3
## [4] IRanges_1.16.4 BiocGenerics_0.4.0 knitr_0.8
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.18.0 colorspace_1.2-0 dichromat_1.2-4
## [4] digest_0.5.2 evaluate_0.4.2 formatR_0.6
## [7] grid_2.15.2 labeling_0.1 lattice_0.20-10
## [10] locfit_1.5-8 munsell_0.4 plyr_1.7.1
## [13] RColorBrewer_1.0-5 R.methodsS3_1.4.2 scales_0.2.2
## [16] stats4_2.15.2 stringr_0.6.1 tools_2.15.2