Drosophila malanogaster gene expression (ATRO and BKS)

This document describes the analysis done for the Drosophila melanogaster RNA-seq data.

Data

Nine libraries of RNA were sequenced using protontorrent technologies. Single end reads with the majority of reads around 90bp. Below are some basic information from the sequencing. The column with “Number of reads” is the raw output from the machine, whereas the column name “Useful reads” are the number of reads mapping to annotated genes in the genome and hence useful for estimating level of gene expression.

Lib Name	Treatment	Replicate	Number of reads	Useful reads
up-68-1	GFP	S1	42672687	14850179
up-68-2	GFP	S2	24426011	8132221
up-68-7	GFP	S3	51585086	15910151
up-68-3	ATRO	S1	47014971	15878464
up-68-4	ATRO	S2	44623423	15213371
up-68-8	ATRO	S3	35314570	11656287
up-68-5	BKS	S1	44839436	16353608
up-68-6	BKS	S2	41291032	14108910
up-68-9	BKS	S3	90130837	28221185

Quality control

The average quality or reads looks okay, but the mean level is slightly lower than commonly seen for Illumina reads. I think this expected with this technology as this is seen in all projects that I have been involved in with proton data.

up-68-1/GFP up-68-2/GFP up-68-7/GFP up-68-3/ATRO up-68-4/ATRO up-68-8/ATRO up-68-5/BKS up-68-6/BKS up-68-9/BKS

Given this pattern there is no need to filter reads based on quality all reads were used as input when mapping reads to the reference genome.

Mapping

The program Star with the following settings.

genome=/proj/b2014133/nobackup/private/BILS/Dme72.fasta
gtf=/proj/b2014133/nobackup/private/BILS/Dme72.gtf
dir=/proj/b2014133/nobackup/private/BILS/StarGenome
Mapping=/proj/b2014133/nobackup/private/BILS/Mapping2

cd /proj/b2014133/nobackup/private/BILS/

for file in *.fq
    do
    star --readFilesIn ${file} --genomeDir ${dir} --outFileNamePrefix /proj/b2014133/nobackup/private/BILS/Mapping2/${file} --outSAMtype BAM SortedByCoordinate --runThreadN 16 --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4 --outReadsUnmapped unmapped_${file}
    done

The unampped reads were then fed into the program bowtie2 for another attempt at mapping the reads to the genome.

cd /pica/v7/b2014133_nobackup/private/BILS/Mapping2/bt
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U ATRO-S1.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - ATRO-S1_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U ATRO-S2.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - ATRO-S2_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U ATRO-S3.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - ATRO-S3_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U GFP-S1.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - GFP-S1_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U GFP-S2.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - GFP-S2_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U GFP-S3.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - GFP-S3_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U BKS-S1.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - BKS-S1_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U BKS-S2.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - BKS-S2_unmapped_remapBowtie2
bowtie2 --local --very-sensitive-local -p 16 -q --mm -x dme72 -U BKS-S3.fqUnmapped.out.mate1 | samtools view -bS - | samtools sort - BKS-S3_unmapped_remapBowtie2

The final file is then created by merging the results of both mappers into the final file using samtools

samtools merge ATRO-S1_all.bam ATRO-S1_unmapped_remapBowtie2.bam ../ATRO-S1.fqAligned.sortedByCoord.out.bam
samtools merge ATRO-S2_all.bam ATRO-S2_unmapped_remapBowtie2.bam ../ATRO-S2.fqAligned.sortedByCoord.out.bam
samtools merge ATRO-S3_all.bam ATRO-S3_unmapped_remapBowtie2.bam ../ATRO-S3.fqAligned.sortedByCoord.out.bam
samtools merge GFP-S1_all.bam GFP-S1_unmapped_remapBowtie2.bam ../GFP-S1.fqAligned.sortedByCoord.out.bam
samtools merge GFP-S2_all.bam GFP-S2_unmapped_remapBowtie2.bam ../GFP-S2.fqAligned.sortedByCoord.out.bam
samtools merge GFP-S3_all.bam GFP-S3_unmapped_remapBowtie2.bam ../GFP-S3.fqAligned.sortedByCoord.out.bam
samtools merge BKS-S1_all.bam BKS-S1_unmapped_remapBowtie2.bam ../BKS-S1.fqAligned.sortedByCoord.out.bam
samtools merge BKS-S2_all.bam BKS-S2_unmapped_remapBowtie2.bam ../BKS-S2.fqAligned.sortedByCoord.out.bam
samtools merge BKS-S3_all.bam BKS-S3_unmapped_remapBowtie2.bam ../BKS-S3.fqAligned.sortedByCoord.out.bam

Gene expression

Using the supplied annotation file for D. melanogaster the mapped reads were converted to counts per gene eg. the number of reads mapping in areas of the genome that is annotated as genes is counted.

library(rsubread)
library(limma)
setwd('~/Dropbox/BilsWork/1012/R-analysis') # this should be set to setwd("/home/thomask/Dme1012/Mapping2/bt") on uppmax
targets <- readTargets('Targets.txt') # Read in sample information

fc <- featureCounts(files=targets$File, annot.ext = "../../Dme72.gtf", isGTFAnnotationFile = TRUE, isPairedEnd = FALSE, strandSpecific = 1, GTF.featureType="exon", GTF.attrType='gene_id', nthreads = 16)

In order to identify differential gene expression we convert the count data frame above to a DGE list that is a well-suited object for the Limma package that uses linear modelling to identify genes that are differentially expressed between the treatments. For the modelling to work ane need to supply a matrix of sample information. In addition I exclude genes with a very low average expression (at low expression noise in sampling will be the dominating signal). As the replicates are not that similar I have done two kind of analysis, one ‘classic’ named design below that compares the replicates as true randow replicates and one named design2 that ‘blocks’ replicates in the analysis and hence looks at differences between GFP and treatments within each replicate.

library(limma)
library(edgeR)

load('Analysis.rdata') # load the precomputed objects from above
rm(list = c('y','targets','design', 'TQ.dge.list', 'TQ.dge.list.rpkm', 'fit', 'isexpr', 'treat')) # remove objects that needs to bu re-computed
targets <- readTargets('Targets.txt')

treat <- factor(targets$Treatment) 
treat <- relevel(treat, 'GFP') # make GFP the reference sample

replicate <- as.factor(targets$Replicate)

design <- model.matrix(~treat) # treat all replicates as independent and assume that comparing GFP-S1 to ATRO-S1 is the same as comparing the GFP-S1 to any other ATRO replicate

design2 <- model.matrix(~replicate + treat) # Blocks the replicate effect and hence extracts differences within replicated experiment before estimating the effect of the treatment (hence assuming that the different replicates might have different 'base' expression levels, but the change in each of them should represont the interesting genes.)

dge.list <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")])
isexpr <- rowSums(cpm(dge.list) > 2) >= 3 # Create filter that contains list of genes that have at least 2 counts per million mapped reads in at least 3 samples  
dge.list <- dge.list[isexpr,,keep.lib.size=FALSE] # perform the actual filtering described above
dge.list <- calcNormFactors(dge.list)

dge.list.rpkm <- rpkm(dge.list, dge.list$genes$Length)
dge.list.cpm <- cpm(dge.list)


Expression.table <- cbind(dge.list.rpkm, dge.list$counts)
write.table(file = 'Expressiontable.txt', sep = '\t', Expression.table)

library(biomaRt)

## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'

ensembl <- useMart('ensembl')
# listDatasets(ensembl)
ensembl <- useDataset('dmelanogaster_gene_ensembl', mart = ensembl)
# filters = listFilters(ensembl)
ids = getBM(attributes=c('flybase_gene_id','flybase_translation_id'), filters='flybase_gene_id', values=row.names(fc$counts), mart=ensembl)

read.table('~/Downloads/test4', stringsAsFactors = FALSE) -> ncrna


dge.list.filt1 <- dge.list[row.names(dge.list) %in% unique(ids[,1]),]
dge.list.filt1 <- calcNormFactors(dge.list.filt1)

dge.list.filt2 <- dge.list[!row.names(dge.list) %in% ncrna[,1],]
dge.list.filt2 <- calcNormFactors(dge.list.filt2)

y <- voom(dge.list, design, plot = T)

fit <- lmFit(y, design)
fit <- eBayes(fit)
summary(results.fit <- decideTests(fit[,2:3]))

##    treatATRO treatBKS
## -1         3        4
## 0       8282     8281
## 1          1        1

yfilt1 <- voom(dge.list.filt1, design, plot = T)

fitfilt1 <- lmFit(yfilt1, design)
fitfilt1 <- eBayes(fitfilt1)
summary(results.fitfilt1 <- decideTests(fitfilt1[,2:3]))

##    treatATRO treatBKS
## -1         3        4
## 0       7657     7656
## 1          1        1

yfilt2 <- voom(dge.list.filt2, design, plot = T)

fitfilt2 <- lmFit(yfilt2, design)
fitfilt2 <- eBayes(fitfilt2)
summary(results.fitfilt2 <- decideTests(fitfilt2[,2:3]))

##    treatATRO treatBKS
## -1         3        4
## 0       8191     8190
## 1          1        1

topTable(fit, coef = 'treatATRO')

##                  GeneID Length      logFC    AveExpr          t
## FBgn0010825 FBgn0010825   8819  5.3803675 9.00736003  12.442682
## FBgn0010194 FBgn0010194   3848 -1.5212181 5.57156479  -7.942941
## FBgn0031589 FBgn0031589   3023 -1.6497696 5.95465443  -6.373467
## FBgn0019643 FBgn0019643   1748 -3.3304140 0.75117800  -9.271044
## FBgn0086446 FBgn0086446    708  0.9268162 4.44266444   5.063730
## FBgn0035262 FBgn0035262   2538 -1.3027296 2.65139897  -5.327082
## FBgn0013679 FBgn0013679    939  1.4907238 5.60141688   4.765095
## FBgn0262952 FBgn0262952   1339  0.8492352 7.32463261   4.376335
## FBgn0031723 FBgn0031723   1209 -9.1289682 0.06589965 -11.104883
## FBgn0035581 FBgn0035581  14158 -2.6211503 2.22909231  -4.482497
##                  P.Value   adj.P.Val           B
## FBgn0010825 9.352629e-07 0.007749588  5.08048538
## FBgn0010194 3.225547e-05 0.053453762  2.51458130
## FBgn0031589 1.644119e-04 0.194616681  1.24921039
## FBgn0019643 9.786705e-06 0.020273159  0.04778987
## FBgn0086446 8.030701e-04 0.604930767 -0.26634190
## FBgn0035262 5.729744e-04 0.527518465 -0.36820512
## FBgn0013679 1.191887e-03 0.759690753 -0.51253817
## FBgn0262952 2.032020e-03 0.997759223 -1.00684401
## FBgn0031723 2.341431e-06 0.009700550 -1.10769994
## FBgn0035581 1.752682e-03 0.968181269 -1.29909220

topTable(fit, coef = 'treatBKS')

##                  GeneID Length     logFC     AveExpr          t
## FBgn0010575 FBgn0010575  11424  5.826218  8.84518097  10.542009
## FBgn0010194 FBgn0010194   3848 -1.398472  5.57156479  -7.664048
## FBgn0035262 FBgn0035262   2538 -1.627158  2.65139897  -6.720691
## FBgn0031589 FBgn0031589   3023 -1.430262  5.95465443  -5.759159
## FBgn0019643 FBgn0019643   1748 -4.165845  0.75117800 -10.652022
## FBgn0031723 FBgn0031723   1209 -8.279527  0.06589965 -10.441185
## FBgn0035581 FBgn0035581  14158 -2.899911  2.22909231  -5.176272
## FBgn0051439 FBgn0051439   2796 -6.870977  0.95941525  -6.419388
## FBgn0028371 FBgn0028371  10443 -2.132522  2.50733518  -4.667289
## FBgn0044811 FBgn0044811    506 -5.493122 -2.09615767  -9.514780
##                  P.Value  adj.P.Val          B
## FBgn0010575 3.548532e-06 0.01058044  4.5531591
## FBgn0010194 4.227567e-05 0.05838271  2.5384719
## FBgn0035262 1.120128e-04 0.13259117  1.0160490
## FBgn0031589 3.362387e-04 0.30956372  0.6815035
## FBgn0019643 3.266656e-06 0.01058044  0.6324550
## FBgn0031723 3.830717e-06 0.01058044 -0.2930157
## FBgn0035581 6.943445e-04 0.47944486 -0.3908606
## FBgn0051439 1.561465e-04 0.16172871 -0.6224687
## FBgn0028371 1.360255e-03 0.66300447 -0.8201900
## FBgn0044811 7.985720e-06 0.01654242 -0.9053319

write.fit(fit[,2:3], results.fit, digits = 20, adjust = 'BH', sep = '\t', file = 'ResFit.txt')


y2 <- voom(dge.list, design2, plot =T)

write.table(file = 'LimmaExpression.txt', y2$E, sep = '\t')
fit2 <- lmFit(y2, design2)
fit2 <- eBayes(fit2)
summary(results.fit2 <- decideTests(fit2[,4:5]))

##    treatATRO treatBKS
## -1        54       55
## 0       8200     8210
## 1         32       21

topTable(fit2, coef = 'treatATRO')

##                  GeneID Length     logFC    AveExpr          t
## FBgn0010825 FBgn0010825   8819  5.374696 9.00736003  40.707225
## FBgn0085813 FBgn0085813   6026  1.337022 6.97840900  12.850079
## FBgn0010194 FBgn0010194   3848 -1.512937 5.57156479 -11.710172
## FBgn0031723 FBgn0031723   1209 -9.136291 0.06589965 -15.851791
## FBgn0015400 FBgn0015400   4968 -2.168927 3.37956156 -11.342622
## FBgn0037544 FBgn0037544    809 -4.017437 3.33812047 -10.764037
## FBgn0026175 FBgn0026175    477 -4.969969 1.10476479 -10.529547
## FBgn0051439 FBgn0051439   2796 -6.659783 0.95941525 -10.670170
## FBgn0031858 FBgn0031858   3254 -6.136463 0.87093610 -10.155240
## FBgn0020257 FBgn0020257   3449  1.453116 6.29818753   9.477579
##                  P.Value    adj.P.Val         B
## FBgn0010825 1.030834e-13 8.541492e-10 21.703614
## FBgn0085813 3.886748e-08 1.073520e-04  9.140583
## FBgn0010194 1.046713e-07 2.168266e-04  8.201421
## FBgn0031723 3.987378e-09 1.651971e-05  8.026303
## FBgn0015400 1.466489e-07 2.430265e-04  8.011538
## FBgn0037544 2.542050e-07 3.297623e-04  7.468473
## FBgn0026175 3.199499e-07 3.313881e-04  6.624163
## FBgn0051439 2.785826e-07 3.297623e-04  6.343715
## FBgn0031858 4.660410e-07 4.290684e-04  6.088530
## FBgn0020257 9.484826e-07 7.144661e-04  5.834829

topTable(fit2, coef = 'treatBKS')

##                  GeneID Length     logFC    AveExpr          t
## FBgn0010575 FBgn0010575  11424  5.787670 8.84518097  25.409049
## FBgn0015400 FBgn0015400   4968 -2.246005 3.37956156 -11.729574
## FBgn0010194 FBgn0010194   3848 -1.388819 5.57156479 -11.776125
## FBgn0031723 FBgn0031723   1209 -8.306857 0.06589965 -14.549444
## FBgn0026175 FBgn0026175    477 -4.642813 1.10476479 -11.051985
## FBgn0037544 FBgn0037544    809 -4.154882 3.33812047 -10.644507
## FBgn0051439 FBgn0051439   2796 -6.842100 0.95941525 -11.359956
## FBgn0031858 FBgn0031858   3254 -6.056058 0.87093610 -10.740447
## FBgn0037324 FBgn0037324   1944 -5.115733 1.12428977  -9.369197
## FBgn0035581 FBgn0035581  14158 -2.957630 2.22909231  -8.966184
##                  P.Value    adj.P.Val         B
## FBgn0010575 2.109896e-11 1.748260e-07 16.798360
## FBgn0015400 1.028505e-07 2.130548e-04  8.353366
## FBgn0010194 9.862022e-08 2.130548e-04  8.149541
## FBgn0031723 1.016146e-08 4.209891e-05  8.116163
## FBgn0026175 1.927389e-07 2.661724e-04  7.397630
## FBgn0037544 2.856787e-07 2.958918e-04  7.358103
## FBgn0051439 1.443051e-07 2.391424e-04  7.150678
## FBgn0031858 2.601066e-07 2.958918e-04  6.879397
## FBgn0037324 1.066624e-06 8.838045e-04  5.860126
## FBgn0035581 1.666254e-06 1.150549e-03  5.624229

write.fit(fit2[,4:5], results = results.fit2, digits = 20, adjust = 'BH', sep = '\t', file = 'ResFit2.txt')
volcanoplot(fit2[,4], highlight = 10, names = substr(fit2$genes$GeneID, 6, 100), main = 'ATRO vs GFP')

volcanoplot(fit2[,5], highlight = 10, names = substr(fit2$genes$GeneID, 6, 100), main = 'BKS vs GFP')

vennDiagram(results.fit2, circle.col=c("blue","red"), main = "Venndiagram of sign genes")

y2filt1 <- voom(dge.list.filt1, design2, plot = F)
fit2filt1 <- lmFit(y2filt1, design2)
fit2filt1 <- eBayes(fit2filt1)
summary(results.fit2filt1 <- decideTests(fit2filt1[,4:5]))

##    treatATRO treatBKS
## -1        56       55
## 0       7583     7591
## 1         22       15

y2filt2 <- voom(dge.list.filt2, design2, plot = F)
fit2filt2 <- lmFit(y2filt2, design2)
fit2filt2 <- eBayes(fit2filt2)
summary(results.fit2 <- decideTests(fit2filt2[,4:5]))

##    treatATRO treatBKS
## -1        54       55
## 0       8113     8123
## 1         28       17

Drosophila malanogaster gene expression (ATRO and BKS)

Thomas Källman, BILS

25 Jun 2015

Data

Quality control

Mapping

Gene expression