This document describes the analysis done for the Drosophila melanogaster RNA-seq data.
Nine libraries of RNA were sequenced using protontorrent technologies. Single end reads with the majority of reads around 90bp. Below are some basic information from the sequencing. The column with “Number of reads” is the raw output from the machine, whereas the column name “Useful reads” are the number of reads mapping to annotated genes in the genome and hence useful for estimating level of gene expression.
Lib Name | Treatment | Replicate | Number of reads | Useful reads |
---|---|---|---|---|
up-68-1 | GFP | S1 | 42672687 | 14850179 |
up-68-2 | GFP | S2 | 24426011 | 8132221 |
up-68-7 | GFP | S3 | 51585086 | 15910151 |
up-68-3 | ATRO | S1 | 47014971 | 15878464 |
up-68-4 | ATRO | S2 | 44623423 | 15213371 |
up-68-8 | ATRO | S3 | 35314570 | 11656287 |
up-68-5 | BKS | S1 | 44839436 | 16353608 |
up-68-6 | BKS | S2 | 41291032 | 14108910 |
up-68-9 | BKS | S3 | 90130837 | 28221185 |
The average quality or reads looks okay, but the mean level is slightly lower than commonly seen for Illumina reads. I think this expected with this technology as this is seen in all projects that I have been involved in with proton data.
up-68-1/GFP
up-68-2/GFP
up-68-7/GFP
up-68-3/ATRO
up-68-4/ATRO
up-68-8/ATRO
up-68-5/BKS
up-68-6/BKS
up-68-9/BKS
Given this pattern there is no need to filter reads based on quality all reads were used as input when mapping reads to the reference genome.
The program Star with the following settings.
genome=/proj/b2014133/nobackup/private/BILS/Dme72.fasta
gtf=/proj/b2014133/nobackup/private/BILS/Dme72.gtf
dir=/proj/b2014133/nobackup/private/BILS/StarGenome
Mapping=/proj/b2014133/nobackup/private/BILS/Mapping2
cd /proj/b2014133/nobackup/private/BILS/
for file in *.fq
do
star --readFilesIn ${file} --genomeDir ${dir} --outFileNamePrefix /proj/b2014133/nobackup/private/BILS/Mapping2/${file} --outSAMtype BAM SortedByCoordinate --runThreadN 16 --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4 --outReadsUnmapped unmapped_${file}
done
The unampped reads were then fed into the program bowtie2 for another attempt at mapping the reads to the genome.
cd /pica/v7/b2014133_nobackup/private/BILS/Mapping2/bt
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U ATRO-S1.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - ATRO-S1_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U ATRO-S2.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - ATRO-S2_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U ATRO-S3.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - ATRO-S3_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U GFP-S1.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - GFP-S1_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U GFP-S2.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - GFP-S2_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U GFP-S3.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - GFP-S3_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U BKS-S1.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - BKS-S1_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U BKS-S2.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - BKS-S2_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U BKS-S3.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - BKS-S3_unmapped_remapBowtie2
The final file is then created by merging the results of both mappers into the final file using samtools
samtools merge ATRO-S1_all.bam ATRO-S1_unmapped_remapBowtie2.bam ../ATRO-S1.fqAligned.sortedByCoord.out.bam
samtools merge ATRO-S2_all.bam ATRO-S2_unmapped_remapBowtie2.bam ../ATRO-S2.fqAligned.sortedByCoord.out.bam
samtools merge ATRO-S3_all.bam ATRO-S3_unmapped_remapBowtie2.bam ../ATRO-S3.fqAligned.sortedByCoord.out.bam
samtools merge GFP-S1_all.bam GFP-S1_unmapped_remapBowtie2.bam ../GFP-S1.fqAligned.sortedByCoord.out.bam
samtools merge GFP-S2_all.bam GFP-S2_unmapped_remapBowtie2.bam ../GFP-S2.fqAligned.sortedByCoord.out.bam
samtools merge GFP-S3_all.bam GFP-S3_unmapped_remapBowtie2.bam ../GFP-S3.fqAligned.sortedByCoord.out.bam
samtools merge BKS-S1_all.bam BKS-S1_unmapped_remapBowtie2.bam ../BKS-S1.fqAligned.sortedByCoord.out.bam
samtools merge BKS-S2_all.bam BKS-S2_unmapped_remapBowtie2.bam ../BKS-S2.fqAligned.sortedByCoord.out.bam
samtools merge BKS-S3_all.bam BKS-S3_unmapped_remapBowtie2.bam ../BKS-S3.fqAligned.sortedByCoord.out.bam
Using the supplied annotation file for D. melanogaster the mapped reads were converted to counts per gene eg. the number of reads mapping in areas of the genome that is annotated as genes is counted.
library(rsubread)
library(limma)
setwd('~/Dropbox/BilsWork/1012/R-analysis') # this should be set to setwd("/home/thomask/Dme1012/Mapping2/bt") on uppmax
targets <- readTargets('Targets.txt') # Read in sample information
fc <- featureCounts(files=targets$File, annot.ext = "../../Dme72.gtf", isGTFAnnotationFile = TRUE, isPairedEnd = FALSE, strandSpecific = 1, GTF.featureType="exon", GTF.attrType='gene_id', nthreads = 16)
In order to identify differential gene expression we convert the count data frame above to a DGE list that is a well-suited object for the Limma package that uses linear modelling to identify genes that are differentially expressed between the treatments. For the modelling to work ane need to supply a matrix of sample information. In addition I exclude genes with a very low average expression (at low expression noise in sampling will be the dominating signal). As the replicates are not that similar I have done two kind of analysis, one ‘classic’ named design below that compares the replicates as true randow replicates and one named design2 that ‘blocks’ replicates in the analysis and hence looks at differences between GFP and treatments within each replicate.
library(limma)
library(edgeR)
load('Analysis.rdata') # load the precomputed objects from above
rm(list = c('y','targets','design', 'TQ.dge.list', 'TQ.dge.list.rpkm', 'fit', 'isexpr', 'treat')) # remove objects that needs to bu re-computed
targets <- readTargets('Targets.txt')
treat <- factor(targets$Treatment)
treat <- relevel(treat, 'GFP') # make GFP the reference sample
replicate <- as.factor(targets$Replicate)
design <- model.matrix(~treat) # treat all replicates as independent and assume that comparing GFP-S1 to ATRO-S1 is the same as comparing the GFP-S1 to any other ATRO replicate
design2 <- model.matrix(~replicate + treat) # Blocks the replicate effect and hence extracts differences within replicated experiment before estimating the effect of the treatment (hence assuming that the different replicates might have different 'base' expression levels, but the change in each of them should represont the interesting genes.)
dge.list <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")])
isexpr <- rowSums(cpm(dge.list) > 2) >= 3 # Create filter that contains list of genes that have at least 2 counts per million mapped reads in at least 3 samples
dge.list <- dge.list[isexpr,,keep.lib.size=FALSE] # perform the actual filtering described above
dge.list <- calcNormFactors(dge.list)
dge.list.rpkm <- rpkm(dge.list, dge.list$genes$Length)
dge.list.cpm <- cpm(dge.list)
Expression.table <- cbind(dge.list.rpkm, dge.list$counts)
write.table(file = 'Expressiontable.txt', sep = '\t', Expression.table)
library(biomaRt)
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
ensembl <- useMart('ensembl')
# listDatasets(ensembl)
ensembl <- useDataset('dmelanogaster_gene_ensembl', mart = ensembl)
# filters = listFilters(ensembl)
ids = getBM(attributes=c('flybase_gene_id','flybase_translation_id'), filters='flybase_gene_id', values=row.names(fc$counts), mart=ensembl)
read.table('~/Downloads/test4', stringsAsFactors = FALSE) -> ncrna
dge.list.filt1 <- dge.list[row.names(dge.list) %in% unique(ids[,1]),]
dge.list.filt1 <- calcNormFactors(dge.list.filt1)
dge.list.filt2 <- dge.list[!row.names(dge.list) %in% ncrna[,1],]
dge.list.filt2 <- calcNormFactors(dge.list.filt2)
y <- voom(dge.list, design, plot = T)
fit <- lmFit(y, design)
fit <- eBayes(fit)
summary(results.fit <- decideTests(fit[,2:3]))
## treatATRO treatBKS
## -1 3 4
## 0 8282 8281
## 1 1 1
yfilt1 <- voom(dge.list.filt1, design, plot = T)
fitfilt1 <- lmFit(yfilt1, design)
fitfilt1 <- eBayes(fitfilt1)
summary(results.fitfilt1 <- decideTests(fitfilt1[,2:3]))
## treatATRO treatBKS
## -1 3 4
## 0 7657 7656
## 1 1 1
yfilt2 <- voom(dge.list.filt2, design, plot = T)
fitfilt2 <- lmFit(yfilt2, design)
fitfilt2 <- eBayes(fitfilt2)
summary(results.fitfilt2 <- decideTests(fitfilt2[,2:3]))
## treatATRO treatBKS
## -1 3 4
## 0 8191 8190
## 1 1 1
topTable(fit, coef = 'treatATRO')
## GeneID Length logFC AveExpr t
## FBgn0010825 FBgn0010825 8819 5.3803675 9.00736003 12.442682
## FBgn0010194 FBgn0010194 3848 -1.5212181 5.57156479 -7.942941
## FBgn0031589 FBgn0031589 3023 -1.6497696 5.95465443 -6.373467
## FBgn0019643 FBgn0019643 1748 -3.3304140 0.75117800 -9.271044
## FBgn0086446 FBgn0086446 708 0.9268162 4.44266444 5.063730
## FBgn0035262 FBgn0035262 2538 -1.3027296 2.65139897 -5.327082
## FBgn0013679 FBgn0013679 939 1.4907238 5.60141688 4.765095
## FBgn0262952 FBgn0262952 1339 0.8492352 7.32463261 4.376335
## FBgn0031723 FBgn0031723 1209 -9.1289682 0.06589965 -11.104883
## FBgn0035581 FBgn0035581 14158 -2.6211503 2.22909231 -4.482497
## P.Value adj.P.Val B
## FBgn0010825 9.352629e-07 0.007749588 5.08048538
## FBgn0010194 3.225547e-05 0.053453762 2.51458130
## FBgn0031589 1.644119e-04 0.194616681 1.24921039
## FBgn0019643 9.786705e-06 0.020273159 0.04778987
## FBgn0086446 8.030701e-04 0.604930767 -0.26634190
## FBgn0035262 5.729744e-04 0.527518465 -0.36820512
## FBgn0013679 1.191887e-03 0.759690753 -0.51253817
## FBgn0262952 2.032020e-03 0.997759223 -1.00684401
## FBgn0031723 2.341431e-06 0.009700550 -1.10769994
## FBgn0035581 1.752682e-03 0.968181269 -1.29909220
topTable(fit, coef = 'treatBKS')
## GeneID Length logFC AveExpr t
## FBgn0010575 FBgn0010575 11424 5.826218 8.84518097 10.542009
## FBgn0010194 FBgn0010194 3848 -1.398472 5.57156479 -7.664048
## FBgn0035262 FBgn0035262 2538 -1.627158 2.65139897 -6.720691
## FBgn0031589 FBgn0031589 3023 -1.430262 5.95465443 -5.759159
## FBgn0019643 FBgn0019643 1748 -4.165845 0.75117800 -10.652022
## FBgn0031723 FBgn0031723 1209 -8.279527 0.06589965 -10.441185
## FBgn0035581 FBgn0035581 14158 -2.899911 2.22909231 -5.176272
## FBgn0051439 FBgn0051439 2796 -6.870977 0.95941525 -6.419388
## FBgn0028371 FBgn0028371 10443 -2.132522 2.50733518 -4.667289
## FBgn0044811 FBgn0044811 506 -5.493122 -2.09615767 -9.514780
## P.Value adj.P.Val B
## FBgn0010575 3.548532e-06 0.01058044 4.5531591
## FBgn0010194 4.227567e-05 0.05838271 2.5384719
## FBgn0035262 1.120128e-04 0.13259117 1.0160490
## FBgn0031589 3.362387e-04 0.30956372 0.6815035
## FBgn0019643 3.266656e-06 0.01058044 0.6324550
## FBgn0031723 3.830717e-06 0.01058044 -0.2930157
## FBgn0035581 6.943445e-04 0.47944486 -0.3908606
## FBgn0051439 1.561465e-04 0.16172871 -0.6224687
## FBgn0028371 1.360255e-03 0.66300447 -0.8201900
## FBgn0044811 7.985720e-06 0.01654242 -0.9053319
write.fit(fit[,2:3], results.fit, digits = 20, adjust = 'BH', sep = '\t', file = 'ResFit.txt')
y2 <- voom(dge.list, design2, plot =T)
write.table(file = 'LimmaExpression.txt', y2$E, sep = '\t')
fit2 <- lmFit(y2, design2)
fit2 <- eBayes(fit2)
summary(results.fit2 <- decideTests(fit2[,4:5]))
## treatATRO treatBKS
## -1 54 55
## 0 8200 8210
## 1 32 21
topTable(fit2, coef = 'treatATRO')
## GeneID Length logFC AveExpr t
## FBgn0010825 FBgn0010825 8819 5.374696 9.00736003 40.707225
## FBgn0085813 FBgn0085813 6026 1.337022 6.97840900 12.850079
## FBgn0010194 FBgn0010194 3848 -1.512937 5.57156479 -11.710172
## FBgn0031723 FBgn0031723 1209 -9.136291 0.06589965 -15.851791
## FBgn0015400 FBgn0015400 4968 -2.168927 3.37956156 -11.342622
## FBgn0037544 FBgn0037544 809 -4.017437 3.33812047 -10.764037
## FBgn0026175 FBgn0026175 477 -4.969969 1.10476479 -10.529547
## FBgn0051439 FBgn0051439 2796 -6.659783 0.95941525 -10.670170
## FBgn0031858 FBgn0031858 3254 -6.136463 0.87093610 -10.155240
## FBgn0020257 FBgn0020257 3449 1.453116 6.29818753 9.477579
## P.Value adj.P.Val B
## FBgn0010825 1.030834e-13 8.541492e-10 21.703614
## FBgn0085813 3.886748e-08 1.073520e-04 9.140583
## FBgn0010194 1.046713e-07 2.168266e-04 8.201421
## FBgn0031723 3.987378e-09 1.651971e-05 8.026303
## FBgn0015400 1.466489e-07 2.430265e-04 8.011538
## FBgn0037544 2.542050e-07 3.297623e-04 7.468473
## FBgn0026175 3.199499e-07 3.313881e-04 6.624163
## FBgn0051439 2.785826e-07 3.297623e-04 6.343715
## FBgn0031858 4.660410e-07 4.290684e-04 6.088530
## FBgn0020257 9.484826e-07 7.144661e-04 5.834829
topTable(fit2, coef = 'treatBKS')
## GeneID Length logFC AveExpr t
## FBgn0010575 FBgn0010575 11424 5.787670 8.84518097 25.409049
## FBgn0015400 FBgn0015400 4968 -2.246005 3.37956156 -11.729574
## FBgn0010194 FBgn0010194 3848 -1.388819 5.57156479 -11.776125
## FBgn0031723 FBgn0031723 1209 -8.306857 0.06589965 -14.549444
## FBgn0026175 FBgn0026175 477 -4.642813 1.10476479 -11.051985
## FBgn0037544 FBgn0037544 809 -4.154882 3.33812047 -10.644507
## FBgn0051439 FBgn0051439 2796 -6.842100 0.95941525 -11.359956
## FBgn0031858 FBgn0031858 3254 -6.056058 0.87093610 -10.740447
## FBgn0037324 FBgn0037324 1944 -5.115733 1.12428977 -9.369197
## FBgn0035581 FBgn0035581 14158 -2.957630 2.22909231 -8.966184
## P.Value adj.P.Val B
## FBgn0010575 2.109896e-11 1.748260e-07 16.798360
## FBgn0015400 1.028505e-07 2.130548e-04 8.353366
## FBgn0010194 9.862022e-08 2.130548e-04 8.149541
## FBgn0031723 1.016146e-08 4.209891e-05 8.116163
## FBgn0026175 1.927389e-07 2.661724e-04 7.397630
## FBgn0037544 2.856787e-07 2.958918e-04 7.358103
## FBgn0051439 1.443051e-07 2.391424e-04 7.150678
## FBgn0031858 2.601066e-07 2.958918e-04 6.879397
## FBgn0037324 1.066624e-06 8.838045e-04 5.860126
## FBgn0035581 1.666254e-06 1.150549e-03 5.624229
write.fit(fit2[,4:5], results = results.fit2, digits = 20, adjust = 'BH', sep = '\t', file = 'ResFit2.txt')
volcanoplot(fit2[,4], highlight = 10, names = substr(fit2$genes$GeneID, 6, 100), main = 'ATRO vs GFP')
volcanoplot(fit2[,5], highlight = 10, names = substr(fit2$genes$GeneID, 6, 100), main = 'BKS vs GFP')
vennDiagram(results.fit2, circle.col=c("blue","red"), main = "Venndiagram of sign genes")
y2filt1 <- voom(dge.list.filt1, design2, plot = F)
fit2filt1 <- lmFit(y2filt1, design2)
fit2filt1 <- eBayes(fit2filt1)
summary(results.fit2filt1 <- decideTests(fit2filt1[,4:5]))
## treatATRO treatBKS
## -1 56 55
## 0 7583 7591
## 1 22 15
y2filt2 <- voom(dge.list.filt2, design2, plot = F)
fit2filt2 <- lmFit(y2filt2, design2)
fit2filt2 <- eBayes(fit2filt2)
summary(results.fit2 <- decideTests(fit2filt2[,4:5]))
## treatATRO treatBKS
## -1 54 55
## 0 8113 8123
## 1 28 17