Below is the work-flow I used to analyze gene expression differences in post-mortem human hippocampal brain tissue between non-drug users and chronic cocaine users that overdosed on cocaine. Specifically, I obtained data from the Sequence Read Archive, aligned the data with tophat2 using the most recent version of the human genome (hg38), used HTSeq counts to obtain count data from the RNAseq reads and performed the analysis with DESeq2.

First, I uploaded all the packages necessary for performing this analysis.

## Warning: package 'knitr' was built under R version 3.3.2
## Warning: package 'DESeq2' was built under R version 3.3.2
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 3.3.3
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: IRanges
## Warning: package 'IRanges' was built under R version 3.3.3
## Loading required package: GenomicRanges
## Warning: package 'GenomicRanges' was built under R version 3.3.3
## Loading required package: GenomeInfoDb
## Warning: package 'GenomeInfoDb' was built under R version 3.3.2
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.3.2
## Loading required package: annotate
## Warning: package 'annotate' was built under R version 3.3.2
## Loading required package: AnnotationDbi
## Warning: package 'AnnotationDbi' was built under R version 3.3.2
## Loading required package: XML
## Warning: package 'XML' was built under R version 3.3.2
## Warning: package 'matrixStats' was built under R version 3.3.2
## 
## Attaching package: 'matrixStats'
## The following object is masked from 'package:SummarizedExperiment':
## 
##     rowRanges
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## 
## Attaching package: 'genefilter'
## The following objects are masked from 'package:matrixStats':
## 
##     rowSds, rowVars
## 
## Attaching package: 'data.table'
## The following object is masked from 'package:SummarizedExperiment':
## 
##     shift
## The following object is masked from 'package:GenomicRanges':
## 
##     shift
## The following object is masked from 'package:IRanges':
## 
##     shift
## The following objects are masked from 'package:S4Vectors':
## 
##     first, second
## Loading required package: mgcv
## Warning: package 'mgcv' was built under R version 3.3.2
## Loading required package: nlme
## Warning: package 'nlme' was built under R version 3.3.2
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:IRanges':
## 
##     collapse
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.

Then, I uploaded the files from the directory on my computer & created a count data set

setwd('/Users/shuggett/Desktop/RNAseqCocaine/')
directory <- (getwd())
sampleFiles <- grep('C',list.files(directory), value = T)
sampleCondition <- sub("C",'\\1',sampleFiles)
sampleTable <- data.frame(sampleName = sampleFiles, fileName = sampleFiles,
                          condition = sampleCondition)

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable, directory = directory, design = ~ condition )

These commands will visualize the data set we created & will familarize you with the layout of these R objects we will use in the analysis.

head(assay(ddsHTSeq))
##            C1_count1.gff C1_count2.gff C1_count3.gff C1_count4.gff
## uc001aak.4             0             0             0             0
## uc001aal.1             0             0             0             0
## uc001aaq.3             0             0             0             0
## uc001aar.3             0             0             0             0
## uc001abo.4            20            14             7            14
## uc001abp.3             0             0             0             0
##            C1_count5.gff C1_count6.gff C1_count7.gff C1_count8.gff
## uc001aak.4             0             0             0             0
## uc001aal.1             0             0             0             0
## uc001aaq.3             0             0             0             0
## uc001aar.3             0             0             0             0
## uc001abo.4             9            13            10             5
## uc001abp.3             0             0             0             0
##            C2_count1.gff C2_count2.gff C2_count3.gff C2_count4.gff
## uc001aak.4             0             0             0             0
## uc001aal.1             0             0             0             0
## uc001aaq.3             0             0             0             0
## uc001aar.3             0             0             0             0
## uc001abo.4             3            16            16             3
## uc001abp.3             0             0             0             0
##            C2_count5.gff C2_count6.gff C2_count7.gff C2_count8.gff
## uc001aak.4             0             0             0             0
## uc001aal.1             0             0             0             0
## uc001aaq.3             0             0             0             0
## uc001aar.3             0             0             0             0
## uc001abo.4            17            12            33             6
## uc001abp.3             0             0             0             0
##            C2_count9.gff
## uc001aak.4             0
## uc001aal.1             0
## uc001aaq.3             0
## uc001aar.3             0
## uc001abo.4            22
## uc001abp.3             0
mean(colSums(assay(ddsHTSeq)))
## [1] 1105471
colData(ddsHTSeq)
## DataFrame with 17 rows and 1 column
##                  condition
##                   <factor>
## C1_count1.gff 1_count1.gff
## C1_count2.gff 1_count2.gff
## C1_count3.gff 1_count3.gff
## C1_count4.gff 1_count4.gff
## C1_count5.gff 1_count5.gff
## ...                    ...
## C2_count5.gff 2_count5.gff
## C2_count6.gff 2_count6.gff
## C2_count7.gff 2_count7.gff
## C2_count8.gff 2_count8.gff
## C2_count9.gff 2_count9.gff
mcols(rowData(ddsHTSeq))
## DataFrame with 0 rows and 2 columns

Now we will perform some basic quality control steps. Essentially these commands will select all genes that contain more than one read

GeneCounts <- counts(ddsHTSeq)
idx.nz <- apply(GeneCounts, 1, function(x) { all(x > 0)})
sum(idx.nz)
## [1] 10576
idx <- which(apply(GeneCounts, 1, function(x) { all(x > 0)}))
RNAnames <- names(idx)
nz.counts <- subset(GeneCounts, idx.nz)
sam <- sample(dim(nz.counts)[1], 5)
nz.counts[sam, ]
##            C1_count1.gff C1_count2.gff C1_count3.gff C1_count4.gff
## uc002lfj.5            26            19            10            29
## uc059emq.1            22             3            11            42
## uc063jcl.1             2             2             6            13
## uc058aia.1            11             1             5            14
## uc001mtc.5            31             6            29            38
##            C1_count5.gff C1_count6.gff C1_count7.gff C1_count8.gff
## uc002lfj.5            17            27            24            15
## uc059emq.1            18            23            14            11
## uc063jcl.1             3             4             3             3
## uc058aia.1             6             6             4             7
## uc001mtc.5            19            15            23            41
##            C2_count1.gff C2_count2.gff C2_count3.gff C2_count4.gff
## uc002lfj.5             5            16            16             1
## uc059emq.1             9            13            15             3
## uc063jcl.1             3             6             7             1
## uc058aia.1             4             3             6             2
## uc001mtc.5             9            14            11             2
##            C2_count5.gff C2_count6.gff C2_count7.gff C2_count8.gff
## uc002lfj.5            27            27            22            13
## uc059emq.1            26            10            16            12
## uc063jcl.1             6            11             8             7
## uc058aia.1            16            13            12             9
## uc001mtc.5            31            23            54            30
##            C2_count9.gff
## uc002lfj.5            21
## uc059emq.1            23
## uc063jcl.1             7
## uc058aia.1            14
## uc001mtc.5            34

The next series of commands will split our data up into cases (cocaine users) and controls (non-drug users) by using a standard ifelse statement. Then we will estimate the size factors using the median ratio method proposed by Anders & Huber, 2010. Counts will be normalized across cases and controls to minimize bias of the analyses. Then, we will visualize the normalization process (using an ECDF graph) to ensure it worked correctly & include a red vertical line on this plot to indicate the mean number of reads for our data set.

## Warning in `[<-.factor`(`*tmp*`, i, value = c("control", "control",
## "control", : invalid factor level, NA generated
## DataFrame with 17 rows and 1 column
##               condition
##                <factor>
## C1_count1.gff   control
## C1_count2.gff   control
## C1_count3.gff   control
## C1_count4.gff   control
## C1_count5.gff   control
## ...                 ...
## C2_count5.gff   cocaine
## C2_count6.gff   cocaine
## C2_count7.gff   cocaine
## C2_count8.gff   cocaine
## C2_count9.gff   cocaine
## C1_count1.gff C1_count2.gff C1_count3.gff C1_count4.gff C1_count5.gff 
##     1.7966296     0.9217811     0.9792610     1.6269046     1.0939893 
## C1_count6.gff C1_count7.gff C1_count8.gff C2_count1.gff C2_count2.gff 
##     1.4977666     1.3689498     1.1603190     0.3807074     0.9837674 
## C2_count3.gff C2_count4.gff C2_count5.gff C2_count6.gff C2_count7.gff 
##     1.0095498     0.1417257     1.8200930     1.1801449     1.4021843 
## C2_count8.gff C2_count9.gff 
##     1.0870040     1.3552473

This is where things get JUICY!!! We will perform principle component analysis (PCA) on the top 50 differntially expressed genes to detect any outliers within our data set. We will plot the first 2 PCs and color coding the cases (black) and controls (red).

Before performing PCA, we will transform the count data into a log2 scale which will minimize the differences between samples with low read counts as suggested by Love, Huber & Anders, 2014 Genome Biology paper.

One sample substantially differs from the others and is considered an outlier. In order to decrease variability for the analysis and ultimately/likley increase our power - we will remove this participant from the analysis.

outliers <- as.character(subset(colnames(ddsHTSeq), dataGG$PC2 < -5))
outliers
## [1] "C2_count8.gff"
ddsHTSeq <- ddsHTSeq[, !(colnames(ddsHTSeq) %in% outliers)] 

After removal of the outlier, we will perform the quality control steps we did previously. Now we have increased the total number of genes we will have in our analysis by a couple hundred

GeneCounts <- counts(ddsHTSeq)
idx.nz <- apply(GeneCounts, 1, function(x) { all(x > 0)})
sum(idx.nz)
## [1] 10623
idx <- which(apply(GeneCounts, 1, function(x) { all(x > 0)}))
RNAnames <- names(idx)
nz.counts <- subset(GeneCounts, idx.nz)
sam <- sample(dim(nz.counts)[1], 5)
nz.counts[sam, ]
##            C1_count1.gff C1_count2.gff C1_count3.gff C1_count4.gff
## uc032nrt.1            10             2             2            13
## uc002pof.3            18             4             5             7
## uc002bjf.2           149           132           171           257
## uc003awd.3             1             2             4             7
## uc062vok.1            17             7            11            26
##            C1_count5.gff C1_count6.gff C1_count7.gff C1_count8.gff
## uc032nrt.1             5             8             4             9
## uc002pof.3             9             8             6             5
## uc002bjf.2            92           139           246           154
## uc003awd.3             4             2             2             2
## uc062vok.1             7            12             6            10
##            C2_count1.gff C2_count2.gff C2_count3.gff C2_count4.gff
## uc032nrt.1             3             8            10             2
## uc002pof.3             4             6             6             1
## uc002bjf.2            54           216           143             9
## uc003awd.3             5             5             5             1
## uc062vok.1             2             8             9             1
##            C2_count5.gff C2_count6.gff C2_count7.gff C2_count9.gff
## uc032nrt.1            15             9             7            12
## uc002pof.3             8             7             7             5
## uc002bjf.2           347           196           310           215
## uc003awd.3             4             6             2             7
## uc062vok.1            19            10            10            17

To enusre there are no more outliers in our data – we will re-run the PCA analysis, this time using ALL gene variants.

rld <- rlogTransformation(ddsHTSeq, blind = T)
ntop = 82977
Pvars <- rowVars(assay(rld))
select <- order(Pvars, decreasing = TRUE)[seq_len(min(ntop,length(Pvars)))]
PCA <- prcomp(t(assay(rld)[select, ]), scale = F)
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)

dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], 
                    PC3 = PCA$x[,3], PC4 = PCA$x[,4], 
                    condition = colData(rld)$condition)

plot(dataGG$PC1, dataGG$PC2, col = dataGG$condition, xlab = paste0('PC1, VarExp: ', percentVar[1])
     ,ylab = paste0('PC2, VarExp: ', percentVar[2]), main = 'PCA plot', pch = 10, cex = 2)
legend(25, -20, col = c('red', 'black'), pch = 10, c('Control', 'Cocaine') )

We can see a pretty clear clustering of cases and controls. This could indicate a true effect within our data or implicate the importance of batch effects. Batch effects have profound biases on high-throuput analyses and warrant proper inspection / adjustment for our analyses. We will use a surrogate variable analysis as proposed by Jeff Leek and colleagues to account for the potential batch effect bias within our data.

dds <- ddsHTSeq
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized = T)
idx <- rowMeans(dat) > 1
dat  <- dat[idx, ]
colData(dds)[1:8,1] <- 'control'
colData(dds)$condition <- factor(ifelse(is.na(colData(dds)$condition), 'control', 'cocaine'  ))
colData(dds)
## DataFrame with 16 rows and 2 columns
##               condition sizeFactor
##                <factor>  <numeric>
## C1_count1.gff   cocaine  1.8088861
## C1_count2.gff   cocaine  0.9296628
## C1_count3.gff   cocaine  0.9828487
## C1_count4.gff   cocaine  1.6339156
## C1_count5.gff   cocaine  1.1024977
## ...                 ...        ...
## C2_count4.gff   cocaine  0.1438951
## C2_count5.gff   cocaine  1.8197696
## C2_count6.gff   cocaine  1.1848570
## C2_count7.gff   cocaine  1.4115087
## C2_count9.gff   cocaine  1.3619717
dds <- ddsHTSeq
mod  <- model.matrix(~ condition, colData(dds))
mod0 <- model.matrix(~   1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv = 2)
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

Below we create vectors that correspond to some co-variates we will include in our analysis.

age <- c(43,30,36,31,41,46,32,41,37,35,41,32,38,45,49,35)
pmi <- c(8,16.8,22,22,18.3,16,18,16,19,20,18.5,22,20,12,19,15)
b1 <- svseq$sv[,1]
b2 <- svseq$sv[,2]

Now we will have to make a new r object that includes all of the co-variates (age, PMI, SVA 1/2, PC 1/2 and the interaction btwn age & PMI) for our analysis

sTable <- data.frame(sampleName = sampleFiles[-16], fileName = sampleFiles[-16],
                     condition = sampleCondition[-16], b1,b2, dataGG[,1:2],age,pmi)

dds <- DESeqDataSetFromHTSeqCount(sTable, directory = directory, design = ~ condition )
design(dds) <- formula(~ b1 + b2 + PC1 + PC2 + age + pmi + age:pmi + condition)
ddsN <- estimateSizeFactors(dds)
sizeFactors(ddsN)
## C1_count1.gff C1_count2.gff C1_count3.gff C1_count4.gff C1_count5.gff 
##     1.8088861     0.9296628     0.9828487     1.6339156     1.1024977 
## C1_count6.gff C1_count7.gff C1_count8.gff C2_count1.gff C2_count2.gff 
##     1.5068281     1.3760156     1.1639772     0.3836261     0.9854404 
## C2_count3.gff C2_count4.gff C2_count5.gff C2_count6.gff C2_count7.gff 
##     1.0117164     0.1438951     1.8197696     1.1848570     1.4115087 
## C2_count9.gff 
##     1.3619717

After re-specifying our data set to accomodate for co-variates we also need to re-specify the case-control status of our participants. Then we can estimate dispersion, perform our simple case control analysis using a negative binomial wald test to identify differentially expressed genes.

colData(ddsN)[1:8,1] <- 'control'
## Warning in `[<-.factor`(`*tmp*`, i, value = c("control", "control",
## "control", : invalid factor level, NA generated
colData(ddsN)$condition <- factor(ifelse(is.na(colData(ddsN)$condition), 'control', 'cocaine'  ))
des2 <- estimateDispersions(ddsN)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
des2 <- nbinomWaldTest(des2)
## 494 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
des2Res <- results(des2, pAdjustMethod = 'bonferroni')
sum(des2Res$pvalue < .05, na.rm = T)
## [1] 197

Our initial analysis completed, but warrants inspection to make sure our test was performed properly. The first step is to plot a histogram of p-values generated by this analysis.

We can see a strong negative skew for our p-values from this negative binomial wald test. Likely our analysis is using the incorrect null variance of p-values. Ideally, we would like to see a uniform distribution of p-values with a peak towards the lower p-values (if there is indeed a true effect). We will perform empirical null modeling to callibrate our test properlly and ensure that our analysis is working properly. The fdrtool package makes this process fairly easy.

## Step 1... determine cutoff point
## Step 2... estimate parameters of null distribution and eta0
## Step 3... compute p-values and estimate empirical PDF/CDF
## Step 4... compute q-values and local fdr
## Step 5... prepare for plotting

Now we have a proper distribution of p-values for our test. To adjust for multiple testing we will accomodate our p-value threshold by the number of tests we are performing (82,977 splice variants). To reduce the number of false positives, I incorporated a conservative correction technique (bonferroni) that divides our p-value threshold by the number of tests performed (.05 / 82,977 = 6.025766e-07). Thus, significantly differentially expressed genes will require a p-value less than this threshold.

table(des2Res[,'padj'] <  6.025766e-07)
## 
## FALSE  TRUE 
## 82788   189

We have 189 significantly differentially expressed genes!! Now, to plot our results with a MA plot. All red dots indicate [significant] differential expression between cocaine and non-drug conditions