We need many packages to run the code in this notebook. Here are they
library(tidygenomics) # latest development version can be installed by devtools::install_github("const-ae/tidygenomics")
library(biobroom) # devtools::install_github("StoreyLab/biobroom")
library(tidyverse)
# source("http://www.bioconductor.org/biocLite.R")
# biocLite("DESeq2", ask=FALSE, suppressUpdates=TRUE)
# biocLite("limma", ask=FALSE, suppressUpdates=TRUE)
human_genes_22 <- readRDS(gzcon(url("https://s3-us-west-2.amazonaws.com/veri-analizi/human_genes_chr22.rds")))
human_snp_22 <- readRDS(gzcon(url("https://s3-us-west-2.amazonaws.com/veri-analizi/human_snp147_chr22.rds")))
Bioconductor packages have their own repository. Some packages are both in CRAN and in Bioconductor, but some packages are solely on Bioconductor. Thus, for these packages you’ll get the following error:
> install.packages("TCGAbiolinks")
Installing package into '/usr/local/lib/R/site-library'
(as 'lib' is unspecified)
Warning in install.packages :
package 'TCGAbiolinks' is not available (for R version 3.4.1)
But, same package can be installed with the following command:
# DO NOT RUN THIS! THIS IS AN EXAMPLE!
source("https://bioconductor.org/biocLite.R")
biocLite("TCGAbiolinks")
There are many Bioconductor packages available for many domains. Please go over the Search page
To have a deeper insight about some Bioconductor packages please visit the Tutorials page.
There are many, specific and complicated objects in R. GRanges, BioString, ExpressionSet, SummarizedExperiment, RangedSummarizedExperiment are few of them.
Here’s visual representation of RangedSummarizedExperiment object:
(Source)
More information about GRanges is located at its Bioconductor page. First PDF file (GenomicRanges Intro), chapter 2.3 mentions interval operations. Chapter 3 mentions GRangesList object (even more complicated) where list of ranges are grouped.
Interval operations of genomic ranges can be done in tibble formatted tables. The tables can be result of a pipe or any other dplyr/tidyverse command.
Examples are at: https://github.com/const-ae/tidygenomics
library(tidygenomics)
library(tidyverse)
x1 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr2"),
start = c(100, 200, 300, 400),
end = c(150, 250, 350, 450))
x2 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr2", "chr2", "chr1"),
start = c(140, 210, 400, 300),
end = c(160, 240, 415, 320))
x1
x2
genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
x2 <- data.frame(id = 1:4,
chromosome = c("chr1", "chr2", "chr1", "chr1"),
start = c(120, 210, 300, 400),
end = c(125, 240, 320, 415))
genome_subtract(x1, x2, by=c("chromosome", "start", "end"))
x2 <- data_frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr1", "chr2"),
start = c(220, 210, 300, 400),
end = c(225, 240, 320, 415))
genome_join_closest(x1, x2, by=c("chromosome", "start", "end"), distance_column_name="distance", mode="left")
Column `chromosome` joining factor and character vector, coercing into character vector
genome_cluster(x1, by=c("chromosome", "start", "end"))
genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)
genome_complement(x1, by=c("chromosome", "start", "end"))
x2 <- data_frame(id = 1:4,
chromosome = c("chr1", "chr1", "chr1", "chr2"),
start = c(220, 210, 300, 400),
end = c(225, 240, 320, 415))
fuzzyjoin::genome_join(x1, x2, by=c("chromosome", "start", "end"), mode="inner")
Column `chromosome` joining factor and character vector, coercing into character vector
fuzzyjoin::genome_join(x1, x2, by=c("chromosome", "start", "end"), mode="left")
Column `chromosome` joining factor and character vector, coercing into character vector
fuzzyjoin::genome_join(x1, x2, by=c("chromosome", "start", "end"), mode="anti")
Column `chromosome` joining factor and character vector, coercing into character vector
human_genes_22 # loaded from url in first chunk
human_snp_22 # loaded from url in first chunk
Notice that column names and shapes of two files can be different (e.g., strand can be “-” or “-1”), other softwares or packages would complain about this issue. But being free of any restrictions, we can manipulate with dplyr verbs and then use them
Let’s see which snps are within human genes
human_genes_22 %>%
genome_intersect(human_snp_22, by=c("chrom", "start", "end"), mode="both")
Since this is tibble, let’s count snps per gene
human_genes_22 %>%
genome_intersect(human_snp_22, by=c("chrom", "start", "end"), mode="both") %>%
count(name.x)
Let’s normalize this with gene length
human_genes_22 %>%
genome_intersect(human_snp_22, by=c("chrom", "start", "end"), mode="both") %>%
count(name.x) %>%
inner_join(human_genes_22, by=c("name.x"="name")) %>%
mutate(snp_per_kb=n / ( (end-start) /1000) ) %>%
arrange(-snp_per_kb)
What about, snps within promoter regions? Let’s generate 1000 bp upstream of genes with dplyr verbs and then intersect.
human_genes_22 %>%
mutate(end=start-1, start=start-1000) %>%
genome_intersect(human_snp_22, by=c("chrom", "start", "end"), mode="both")
It works! But we made a big mistake, the genes on negative strand should have their upstream regions extended towards right, not left. So, we can use ifelse function to extend promoter regions according to strand value.
human_genes_22 %>%
mutate(p_end=ifelse(strand=="+", start-1, end+1000),
p_start=ifelse(strand=="+", start-1000, end+1),
start=p_start,
end=p_end) %>%
genome_intersect(human_snp_22, by=c("chrom", "start", "end"), mode="both")
Also, notice that we didn’t use strand in intersections. We can use this column to see strand specific intersections. We already know that genes have strand specificity, andt the snp data contains strand information. Let’s find snps in promoter regions where promoter region strand and snp strand matches. For that, we use filter verb:
human_genes_22 %>%
mutate(p_end=ifelse(strand=="+", start-1, end+1000),
p_start=ifelse(strand=="+", start-1000, end+1),
start=p_start,
end=p_end) %>%
genome_intersect(human_snp_22, by=c("chrom", "start", "end"), mode="both") %>%
filter(strand.x==strand.y)
As you can see, you can feed genome_intersect with a custom table and you can use genome_intersect within a pipe. That’s huge benefit allowing flexibility.
The pioneer in genome arithmetic is bedtools software which can perform any possible genome interval operation. Please check the tutorial page for numerous examples.
Tidygenomics package is trying to implement some of the basic operations done by bedtools software. To do more complicated and advanced operations some R packages interface with the bedtools software installed locally. But valr package performs most of the bedtools operations natively. Please visit its main page for more info.
#PLEASE DO NOT RUN, THIS IS JUST AN EXAMPLE
# devtools::install_github("rnabioco/valrdata") #installs valr and valrdata
library(valr)
# valrdata::dnase_data %>% bind_rows(.id="id")
valrdata::repeat_data
valrdata::hg19_genes # similar to human_genes dataset we have used in past
An extreme example, from valr examples page.
# PLEASE DO NOT RUN, THIS IS AN EXAMPLE
devtools::install_github("rnabioco/valrdata")
library(valr)
res <- valrdata::dnase_data %>%
cross2(.,.) %>%
map(lift(bed_jaccard)) %>%
map('jaccard') %>%
flatten_dbl() %>%
matrix(nrow = 20, ncol = 20)
The output is 20x20 matrix having jaccard distances of 20 different data. Below is truncated portion of the result.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.0000000 0.3543413 0.2405429 0.2200565 0.2272345 0.2084734 0.2061044 0.2016117 0.2152373 0.1906840
[2,] 0.3543413 1.0000000 0.2315717 0.2098134 0.2261107 0.1949926 0.1787604 0.1836408 0.1919568 0.1750850
[3,] 0.2405429 0.2315717 1.0000000 0.5014561 0.6125025 0.2749579 0.2522233 0.2598619 0.2795607 0.2473207
[4,] 0.2200565 0.2098134 0.5014561 1.0000000 0.5063695 0.2538434 0.2346832 0.2365053 0.2621719 0.2208662
[5,] 0.2272345 0.2261107 0.6125025 0.5063695 1.0000000 0.2702726 0.2264408 0.2487867 0.2621962 0.2350299
[6,] 0.2084734 0.1949926 0.2749579 0.2538434 0.2702726 1.0000000 0.4515375 0.6073708 0.5761303 0.5547881
[7,] 0.2061044 0.1787604 0.2522233 0.2346832 0.2264408 0.4515375 1.0000000 0.4818408 0.5238530 0.4808681
[8,] 0.2016117 0.1836408 0.2598619 0.2365053 0.2487867 0.6073708 0.4818408 1.0000000 0.5986499 0.5697326
[9,] 0.2152373 0.1919568 0.2795607 0.2621719 0.2621962 0.5761303 0.5238530 0.5986499 1.0000000 0.5384424
[10,] 0.1906840 0.1750850 0.2473207 0.2208662 0.2350299 0.5547881 0.4808681 0.5697326 0.5384424 1.0000000
...truncated...
Authors of
valrpackage also put down a lecture titled “Practical Biological Data Analysis with R/RStudio” please check the lecture contents to have ideas about biological data analysis.
This package is able convert many different biological objects into tidy format. Please visit its help page for more information, some of the examples are run below.
set.seed(2014)
data <- matrix(rnorm(1000), ncol=4)
data[, 1:2] <- data[, 1:2] + .5 # add an effect
rownames(data) <- paste0("gene", 1:nrow(data))
des <- data.frame(treatment = c("a", "a", "b", "b"),confounding = rnorm(4))
lfit <- lmFit(data, model.matrix(~ treatment + confounding, des))
eb <- eBayes(lfit)
head(tidy(lfit))
head(tidy(eb))
# the tidied form puts it in an ideal form for plotting
ggplot(tidy(lfit), aes(estimate)) + geom_histogram(binwidth=1) +
facet_wrap(~ term)
ggplot(tidy(eb), aes(p.value)) + geom_histogram(binwidth=.2) +
facet_wrap(~ term)
hammer
ExpressionSet (storageMode: lockedEnvironment)
assayData: 29516 features, 8 samples
element names: exprs
protocolData: none
phenoData
sampleNames: SRX020102 SRX020103 ... SRX020098-101 (8 total)
varLabels: sample.id num.tech.reps ... Time (5 total)
varMetadata: labelDescription
featureData
featureNames: ENSRNOG00000000001 ENSRNOG00000000007 ... ENSRNOG00000045521 (29516 total)
fvarLabels: gene
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
Check out the tidied version of the data.
head(tidy(hammer))
head(tidy(hammer, addPheno = TRUE))
# source("https://bioconductor.org/biocLite.R")
# biocLite("airway")
library(DESeq2)
library(airway)
data(airway)
airway_se <- airway
airway_dds <- DESeqDataSet(airway_se, design = ~cell + dex)
colData(airway) %>% head()
DataFrame with 6 rows and 9 columns
SampleName cell dex albut Run avgLength Experiment Sample BioSample
<factor> <factor> <factor> <factor> <factor> <integer> <factor> <factor> <factor>
SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568 SAMN02422669
SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567 SAMN02422675
SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571 SAMN02422678
SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572 SAMN02422670
SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575 SAMN02422682
SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576 SAMN02422673
rowRanges(airway) %>% head()
GRangesList object of length 6:
$ENSG00000000003
GRanges object with 17 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] X [99883667, 99884983] - | 667145 ENSE00001459322
[2] X [99885756, 99885863] - | 667146 ENSE00000868868
[3] X [99887482, 99887565] - | 667147 ENSE00000401072
[4] X [99887538, 99887565] - | 667148 ENSE00001849132
[5] X [99888402, 99888536] - | 667149 ENSE00003554016
... ... ... ... . ... ...
[13] X [99890555, 99890743] - | 667156 ENSE00003512331
[14] X [99891188, 99891686] - | 667158 ENSE00001886883
[15] X [99891605, 99891803] - | 667159 ENSE00001855382
[16] X [99891790, 99892101] - | 667160 ENSE00001863395
[17] X [99894942, 99894988] - | 667161 ENSE00001828996
...
<5 more elements>
-------
seqinfo: 722 sequences (1 circular) from an unspecified genome
assay(airway) %>% head()
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 679 448 873 408 1138 1047 770 572
ENSG00000000005 0 0 0 0 0 0 0 0
ENSG00000000419 467 515 621 365 587 799 417 508
ENSG00000000457 260 211 263 164 245 331 233 229
ENSG00000000460 60 55 40 35 78 63 76 60
ENSG00000000938 0 0 2 0 1 0 0 0
tidy(airway_dds) %>% head()
# differential expression analysis
deseq <- DESeq(airway_dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
results <- results(deseq)
# tidy results
tidy(results) %>% head()
Volunteer students are wanted, who would help gathering lecture materials into book format via bookdown
Last Updated 2018-01-03