This document describes the analysis of RNA sequence data from Arabidopsis thaliana. The analysis is divided into four main sections; Mapping short reads to genome sequence, converting mapped reads to measures of gene expression, quality control and detection of differential gene expression. Thera are a number of different parameter settings that can, and should, be adjusted depending on the data and experimental design. These settings as well choice of programs to use in analysis depends on many factors and if you prefer to use other methods and seetings than the once I have used here, you can, if applicable, modify the code in this document or let me know and I can hopefully help out in setting up a new analysis with the requirements you might have.
Sample | Treatment | Replicate | Number of sequenced fragments |
---|---|---|---|
TQ01 | DMSO_mock | R01 | 19525079 |
TQ02 | IAA_50µM | R01 | 19746564 |
TQ03 | 45H02_50µM | R01 | 18675266 |
TQ04 | 60D09_50µM | R01 | 20686795 |
TQ05 | Hit45_50µM | R01 | 17418684 |
TQ06 | DMSO_mock | R02 | 20114660 |
TQ07 | IAA_50µM | R02 | 19923169 |
TQ08 | 45H02_50µM | R02 | 18943350 |
TQ09 | 60D09_50µM | R02 | 18224871 |
TQ10 | Hit45_50µM | R02 | 20158572 |
TQ11 | DMSO_mock | R03 | 21896845 |
TQ12 | IAA_50µM | R03 | 21001968 |
TQ13 | 45H02_50µM | R03 | 19603460 |
TQ14 | 60D09_50µM | R03 | 19419595 |
TQ15 | Hit45_50µM | R03 | 19402201 |
The pair-end reads were mapped using the short read aligner Subreadalign. The analysis were done from within R with the code below. To convert read data to counts per gene eg. gene expreession the function featureCounts from the the same package was used. All these results were then saved as ‘Analysis.data’ that I shared with you via Uppsala University Dropbox service.
library(limma) # Loaded already here for the read targets function
library(Rsubread)
targets <- readTargets('../References/Targets.txt') # Reads in Target file that contains all information for input/output for the short read alignment step. The function is part of the Limma package
buildindex(basename = "TAIR10", reference = '../References/TAIR10_chr_all.fasta') # builds index of genome fasta reference
align(index = 'TAIR10', readfile1 = targets$Read1, readfile2 = targets$Read2, input_format = 'gzFASTQ', output_format = 'BAM', output_file = targets$OutputFile, tieBreakHamming=TRUE, unique=TRUE, indels=5, nthreads = 16) # map reads allow for indels up to 5bp
fc <- featureCounts(files = targets$OutputFile, annot.ext = "/proj/b2014273/private/BILS/References/TAIR4.gtf",isGTFAnnotationFile = TRUE, isPairedEnd = TRUE, strandSpecific = 2, requireBothEndsMapped = FALSE, GTF.featureType = "exon", GTF.attrType = 'gene_id') # count the reads mapping to exons and summarize over genes
The analysis of differential gene expression were done taking both treatment and experiment into account using Limma. Doing this means that treatments are compared to DMSO within replicates and the end results is hence the expcted difference over all samples controlling for any difference there might be in ‘base’-levels between them.
Before the analysis I omit all genes that have fewer than 2 counts per million mapped reads in the majority of libraries. This steps reduce the number of genes from 40745 to 13278. One can retain a larger fraction of genes by altering the filtering, but it is unlikeley to alter the number of significant genes as the variance at low levels of gene expression. In addition the counts data are converted using the ‘voom’ function that: “Transform count data to log2-counts per million (logCPM), estimate the mean-variance relationship and use this to compute appropriate observational-level weights. The data are then ready for linear modelling.”
library(limma)
library(edgeR)
setwd('~/Dropbox/BilsWork/1699/R-analysis/')
load('Analysis.rdata') # Read in the data
rm(list = c('y','design','TQ.dge.list','TQ.dge.list.rpkm','treat','isexpr','targets','fit')) # make sure that the name space are not containing old objects with same naming
TQ.dge.list <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")]) # Convert counts to dge object
TQ.dge.list <- calcNormFactors(TQ.dge.list) # Take library sizes into accaount
# isexpr <- rowSums(cpm(TQ.dge.list) > 5) >= 3 # Create filter genes that have less than 5 counts per willion in at least three libraries
isexpr <- rowSums(cpm(TQ.dge.list) > 2) >= 3
TQ.dge.list <- TQ.dge.list[isexpr,] # Do the actual filtering
TQ.dge.list.rpkm <- rpkm(TQ.dge.list,TQ.dge.list$genes$Length) # create object with rpkm values estimated from the counts
write.table(file = 'CountValues.txt', TQ.dge.list$counts) # Save raw count data
write.table(file = 'FPKMValues.txt', TQ.dge.list.rpkm) # Save FPKM values
targets <- readTargets('../Reference/Targets.txt', sep = ' ') # Read in info on samples and experimental design
treat <- factor(targets$Treatment) # Treatment as factor in stat model
treat <- relevel(treat, ref = 'DMSO_mock') # Set DMSO as the 'Reference' sample
rep <- factor(targets$Replicate) # Replicates in experiment as factor in model
design <- model.matrix(~rep + treat) # create the model matrix. Expression of a gene is the effect of rep + treat, but only comparisons within Rep. is performed
y <- voom(TQ.dge.list, design)
First we just check the structure of the data looking at the correlations among samples and graphically show similarity among samples with a MDS plot.
cor(y$E)
## TQ01.bam TQ02.bam TQ03.bam TQ04.bam TQ05.bam TQ06.bam
## TQ01.bam 1.0000000 0.9855419 0.9894149 0.9827295 0.9841638 0.9915709
## TQ02.bam 0.9855419 1.0000000 0.9845384 0.9747228 0.9785515 0.9882332
## TQ03.bam 0.9894149 0.9845384 1.0000000 0.9904173 0.9916832 0.9873679
## TQ04.bam 0.9827295 0.9747228 0.9904173 1.0000000 0.9926280 0.9784258
## TQ05.bam 0.9841638 0.9785515 0.9916832 0.9926280 1.0000000 0.9806718
## TQ06.bam 0.9915709 0.9882332 0.9873679 0.9784258 0.9806718 1.0000000
## TQ07.bam 0.9864968 0.9924141 0.9834750 0.9741418 0.9774595 0.9904417
## TQ08.bam 0.9881081 0.9795141 0.9910393 0.9884791 0.9881689 0.9869291
## TQ09.bam 0.9816124 0.9763871 0.9895751 0.9927248 0.9913705 0.9818794
## TQ10.bam 0.9849978 0.9802061 0.9916231 0.9910982 0.9924564 0.9853379
## TQ11.bam 0.9656195 0.9631116 0.9683109 0.9614304 0.9634271 0.9603441
## TQ12.bam 0.9520957 0.9621356 0.9577372 0.9503379 0.9553286 0.9486540
## TQ13.bam 0.9517857 0.9507377 0.9630829 0.9593244 0.9615792 0.9451283
## TQ14.bam 0.9397832 0.9362374 0.9560337 0.9620645 0.9604079 0.9308195
## TQ15.bam 0.9456686 0.9464516 0.9599721 0.9614172 0.9635207 0.9385996
## TQ07.bam TQ08.bam TQ09.bam TQ10.bam TQ11.bam TQ12.bam
## TQ01.bam 0.9864968 0.9881081 0.9816124 0.9849978 0.9656195 0.9520957
## TQ02.bam 0.9924141 0.9795141 0.9763871 0.9802061 0.9631116 0.9621356
## TQ03.bam 0.9834750 0.9910393 0.9895751 0.9916231 0.9683109 0.9577372
## TQ04.bam 0.9741418 0.9884791 0.9927248 0.9910982 0.9614304 0.9503379
## TQ05.bam 0.9774595 0.9881689 0.9913705 0.9924564 0.9634271 0.9553286
## TQ06.bam 0.9904417 0.9869291 0.9818794 0.9853379 0.9603441 0.9486540
## TQ07.bam 1.0000000 0.9824103 0.9773864 0.9814544 0.9621701 0.9595839
## TQ08.bam 0.9824103 1.0000000 0.9895693 0.9904173 0.9576352 0.9454395
## TQ09.bam 0.9773864 0.9895693 1.0000000 0.9932859 0.9564619 0.9470988
## TQ10.bam 0.9814544 0.9904173 0.9932859 1.0000000 0.9606972 0.9514290
## TQ11.bam 0.9621701 0.9576352 0.9564619 0.9606972 1.0000000 0.9862283
## TQ12.bam 0.9595839 0.9454395 0.9470988 0.9514290 0.9862283 1.0000000
## TQ13.bam 0.9490899 0.9533659 0.9545318 0.9562291 0.9869687 0.9859408
## TQ14.bam 0.9335456 0.9444968 0.9552341 0.9533477 0.9769780 0.9744433
## TQ15.bam 0.9435281 0.9478007 0.9560347 0.9571586 0.9837041 0.9836156
## TQ13.bam TQ14.bam TQ15.bam
## TQ01.bam 0.9517857 0.9397832 0.9456686
## TQ02.bam 0.9507377 0.9362374 0.9464516
## TQ03.bam 0.9630829 0.9560337 0.9599721
## TQ04.bam 0.9593244 0.9620645 0.9614172
## TQ05.bam 0.9615792 0.9604079 0.9635207
## TQ06.bam 0.9451283 0.9308195 0.9385996
## TQ07.bam 0.9490899 0.9335456 0.9435281
## TQ08.bam 0.9533659 0.9444968 0.9478007
## TQ09.bam 0.9545318 0.9552341 0.9560347
## TQ10.bam 0.9562291 0.9533477 0.9571586
## TQ11.bam 0.9869687 0.9769780 0.9837041
## TQ12.bam 0.9859408 0.9744433 0.9836156
## TQ13.bam 1.0000000 0.9866800 0.9904671
## TQ14.bam 0.9866800 1.0000000 0.9917587
## TQ15.bam 0.9904671 0.9917587 1.0000000
plotMDS(y)
fit <- lmFit(y,design) # fit the model
fit <- eBayes(fit)
The step below is just to validate that taking experiment into the modelling is correct.
#######################################
# This fits a model that do not extract difference within experiments
# It clearly shows that one gain power to detect differentially expressed genes by taking
# replicate into the model
# NB! This is not part of the analysis results.
des <- model.matrix(~treat)
y1 <- voom(TQ.dge.list, des)
fit1 <- lmFit(y1,des) # fit the model
fit1 <- eBayes(fit1)
summary(decideTests(fit1[,2:5]))
## treatA45H02_50 treatA60D09_50 treatHit45_50 treatIAA_50
## -1 0 22 5 16
## 0 13230 12940 13067 13236
## 1 43 311 201 21
########################################
results <- decideTests(fit[,4:7]) # Summarize the number of significant genes, p < 0.05 after BH-correction
vennDiagram(results, circle.col=c("red","green","blue","yellow"), main = 'All significant below adjusted P-value 0.05') # Venn diagram that shows the sets of sign. genes that are common among the contrasts
vennDiagram(results, include = 'down', circle.col=c("red","green","blue","yellow"), main = 'Down-regulated below adjusted P-value 0.05')
vennDiagram(results, include = 'up', circle.col=c("red","green","blue","yellow"), main = 'Up-regulated below adjusted P-value 0.05')
volcanoplot(fit[,4], highlight = 10, names = fit$genes$GeneID, main = '45H02_50' )
volcanoplot(fit[,5], highlight = 10, names = fit$genes$GeneID, main = '60D09_50' )
volcanoplot(fit[,6], highlight = 10, names = fit$genes$GeneID, main = 'Hit45_50' )
volcanoplot(fit[,7], highlight = 10, names = fit$genes$GeneID, main = 'IAA_50' )
write.fit(fit[,4:7], results = results, file = 'LimmaResults.txt' , digits = 30, sep = '\t', adjust = 'BH') # save results
sessionInfo()
## R version 3.1.3 (2015-03-09)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] edgeR_3.8.6 limma_3.22.7
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.8 evaluate_0.6 formatR_1.1 htmltools_0.2.6
## [5] knitr_1.9 rmarkdown_0.5.1 stringr_0.6.2 tools_3.1.3
## [9] yaml_2.1.13