RNA Seq Pipeline with DESeq2

The following are necessary packages for this script and must be installed with all accompanying packages:

suppressWarnings(suppressMessages({
  library('easypackages',warn.conflicts = FALSE, quietly=TRUE)
  library("readr",warn.conflicts = FALSE, quietly=TRUE)
  library("ggplot2",warn.conflicts = FALSE, quietly=TRUE)
  library("RColorBrewer",warn.conflicts = FALSE, quietly=TRUE)
  library("biomaRt",warn.conflicts = FALSE, quietly=TRUE)
  library("DESeq2",warn.conflicts = FALSE, quietly=TRUE)
  library("pheatmap",warn.conflicts = FALSE, quietly=TRUE)
  library("RColorBrewer",warn.conflicts = FALSE, quietly=TRUE)
  library("viridis",warn.conflicts = FALSE, quietly=TRUE)
  library("ggpubr",warn.conflicts = FALSE, quietly=TRUE)
  library('EnhancedVolcano',warn.conflicts = FALSE, quietly=TRUE)}))

If any of these fail to load properly, check current R version and make sure all are correctly installed. I am currently using R 3.6.3 to ensure all packages are compatible

GOTerm Pipeline

This script can be used to manually select Gene Ontology terms for project filtering RNASeq raw data input goes here depending on your current working directory or project

source <- "1 MWK/Research/Shah Lab/RNAseq/PS11/DENV_PS11[7945].csv"

The following code recommends data from STAR alignment, including a column of ENSEMBL ID’s and gene names, otherwise script will need to be adjusted

raw<-read.csv(source, header=TRUE, row.names = 'gene_ref') #'making ENSMBL ID's rownames

Gene names can be removed from this dataset to allow for DESeq2 analysis data frame:

raw$Name <- NULL

Removing any data with ALL 0 reads, which can be made more stringent if desired:

raw <- raw [which(rowSums(raw) > 0), ]
head(raw)

##                  DV1  DV2  DV3 Mock1 Mock2 Mock3
## ENSG00000000003 1800 1526 1583  2588  1580  2281
## ENSG00000000005    2    0    0     1     2     5
## ENSG00000000419 2409 1993 2250  2110  1407  1926
## ENSG00000000457  449  367  370   358   251   399
## ENSG00000000460  479  418  398   637   393   490
## ENSG00000000971 1182 1166 1141  1680  1182  1539

Importing data again for use in attributing gene names to ENSEMBL ID’s for various figures and isolating ENSEMBL ID and Gene Names ONLY of all genes in the data set:

id <- read.csv(source, header=TRUE)
id <- id[,c(1,2)]
head(id)

##          gene_ref     Name
## 1 ENSG00000000003   TSPAN6
## 2 ENSG00000000005     TNMD
## 3 ENSG00000000419     DPM1
## 4 ENSG00000000457    SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938      FGR

Biomart offers various sets of gene names and associations. Choose which is compatible with your data.

In this case, the IDs are ENSEMBL and the genes of interest are human.
If you don’t want to filter by GO terms, then simply exclude filter and value arguments, or bypass this portion of code completely.
The script matches GO terms to ENSEMBL ID’s , filtering out all noninterested genes

bm <- useMart("ensembl") 
bm <- useDataset("hsapiens_gene_ensembl", mart=bm) 
go <- getBM(mart=bm, attributes=c('ensembl_gene_id','go_id'),filter ="go_parent_term",
            values = c("GO:0002376"))

## Cache found

df <- data.frame(go, raw[match(go$ensembl_gene_id, row.names(raw)),])

Now, the data frame is cleaned up for analysis

Remove name column
Create a data matrix for DESeq2 interpretation
Remove NA values and duplicated GO terms

df <- df[,!names(df) %in% c("ensembl_gene_id", "go_id")] 
df <- as.matrix(df)
df <- na.omit(df) 
df <- df[!duplicated(df), ]

STOP HERE and REVIEW for ERRORS

head(df)

##                 DV1 DV2 DV3 Mock1 Mock2 Mock3
## ENSG00000109107 256 234 240   356   234   287
## ENSG00000134545   5   8   4     1     4     3
## ENSG00000108984 938 842 847  1100   802   994
## ENSG00000114021 785 599 611  1031   676   952
## ENSG00000143772   0   0   0     0     0     1
## ENSG00000244731   0   1   1     2     2     0

DESeq2 Pipeline Analysis

First, differentiating between tests (exp) and mock (ctl) trials, which is dependent on the number of different conditions and the number of replicates per condition

condition <- factor(c(rep("DV2", 3), rep("Mock", 3)))
head(condition)

## [1] DV2  DV2  DV2  Mock Mock Mock
## Levels: DV2 Mock

Now, establish a correlation between experiment and mock trials for DESeq2 to use for analysis With this info, a DESeq Data Set can be created from our matrix and condition objects:

coldata <- data.frame(row.names=colnames(df), condition)
dds <- DESeqDataSetFromMatrix(countData=df, colData=coldata, design=~condition)
dds

## class: DESeqDataSet 
## dim: 2313 6 
## metadata(1): version
## assays(1): counts
## rownames(2313): ENSG00000109107 ENSG00000134545 ... ENSG00000029993
##   ENSG00000197879
## rowData names(0):
## colnames(6): DV1 DV2 ... Mock2 Mock3
## colData names(1): condition

IMPORTANT - running pipeline

dds <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

The following will collapse replicates into single data entities, but you can skip this if you want to visualize replicates independently

dds <- collapseReplicates(dds, dds$condition)

Plot dispersions to qualitatively check data is appropriate for further analysis:

plotDispEsts(dds, main="Dispersion plot")

Differential expression results are produced to graphically view the data. The results are then organized by most significant genes and merged with normalized data:

res <- results(dds)
table(res$padj<0.05)

## 
## FALSE  TRUE 
##  1155  1023

res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)

##              Gene  baseMean log2FoldChange      lfcSE      stat        pvalue
## 1 ENSG00000102580 3925.7724      -1.823522 0.05607761 -32.51782 5.970764e-232
## 2 ENSG00000120129 1120.7174      -2.712696 0.09572161 -28.33943 1.129688e-176
## 3 ENSG00000168003 4785.4536      -1.544300 0.05468533 -28.23974 1.902337e-175
## 4 ENSG00000116584 4236.4371      -1.902055 0.06878449 -27.65237 2.613290e-168
## 5 ENSG00000168209  698.4339      -3.302805 0.12143665 -27.19776 6.903021e-163
## 6 ENSG00000172216 1850.8910      -2.017493 0.07427333 -27.16308 1.774055e-162
##            padj       DV2      Mock
## 1 1.300432e-228 17363.734 4385.3389
## 2 1.230231e-173  5518.038  759.2210
## 3 1.381096e-172 20217.496 6179.1926
## 4 1.422937e-165 18933.207 4511.1947
## 5 3.006956e-160  3595.434  323.6292
## 6 6.439818e-160  8420.607 1866.5885

The results can be exported for other use outside of R:

write.csv(resdata, file="diffexpr-results.csv")

Significant genes can be exported as a single list for over representation analysis on various GO servers, such as the PANTHER database

sig <- resdata[resdata$padj<0.05,]
sig <- na.omit(sig)
write.csv2(sig$Gene, file="GO-enrich.txt", quote=FALSE, row.names=FALSE)

Examine plot of p-values, this will again show you if something is very wrong with your data or if there is little change from the control

STOP HERE and REVIEW for ERRORS

hist(res$pvalue, breaks=50, col="grey", main = "P-value Distribution")

As an example, the differing counts of genes can be viewed, here it is the most significant gene:

topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, topGene, "condition")

Volcano Plot

A volcano plot will now be used to analyze the distribution of genes and significant genes.

The experimental conditions must first be compared to mock conditions for fold change.
Familiar gene IDs can be matched using the raw data to ENSEMBL IDs.

res1 <- results(dds,
                contrast = c('condition','DV2','Mock'))
res1 <- lfcShrink(dds,
                  contrast = c('condition','DV2','Mock'), res=res1)

## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
## 
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895

## Warning in qf(0.99, p, m - p): NaNs produced

res1_names <- data.frame(res1, id[match(row.names(res1), id$gene_ref),])

There are a number of parameters that can altered with the Volcano Plot in this particular package. The most important values to select are cutoffs and whether or not labeling is necessary.

EnhancedVolcano(res1,
                lab = res1_names$Name,
                x = 'log2FoldChange',
                y = 'pvalue',
                pCutoff = 0.01,
                FCcutoff = 2,
                xlim = c(-2, 4),
                col=c('#f6d746', 'black', '#e55c30', '#84206b'), labSize=3.0, colAlpha=.4, 
                title="Mock vs. Infected Immune Response", subtitle = NULL,
                gridlines.major = FALSE, gridlines.minor = FALSE, boxedLabels = TRUE, 
                drawConnectors = TRUE, widthConnectors = 0.25, colConnectors = 'black',
                endsConnectors= 'last')

MA plot

There isn’t much difference with this plot than volcano plots, but it can be a better visualization at times. Note the similar parameters:

maplot = ggmaplot(res1, fdr = 0.01, fc = 1, size = 1,
                  palette = c("#e55c30", "#84206b", "#f6d746"),
                  genenames = as.vector(res1_names$Name),
                  legend="top", top = 15,font.label = c("bold", 11),
                  label.rectangle = TRUE, font.legend = c("bold",12), font.main = "bold",
                  xlab = "Log2 Mean Expression",  ylab="Log2 FC")
show(maplot)

Heatmap

The data is normalized so that genes can be compared to one another regardless of lower relative gene counts

res <- results(dds)
vsd <- varianceStabilizingTransformation(dds)

Define an assay of most significantly divergent genes. The number here determines how many genes the heatmap will display. Then, familiar gene IDs are attached similar to the volcano/MA plots.

mat <- assay(vsd)[ head(order(res$padj),25), ] 
mat <- mat - rowMeans(mat) 
mat <- data.frame(mat, id[match(row.names(mat), id$gene_ref),])
rownames(mat) <- mat$Name
mat <- mat[,!names(mat) %in% c("gene_ref", "Name")]

The order of columns on the heatmap can be manually changed:

mat<- mat[,c(2,1)]

The controls and tests can be differentiated, but be careful to match the order of treatments. The annotations can also be given a color scheme.

my_sample_col <- data.frame(Treatment = rep(c("Ctl", "Inf"), c(1,1))) 
row.names(my_sample_col) <- colnames(mat)
ann_colors = list(
  Treatment = c('Inf' = "black", 'Ctl' = "dark grey"))

Here is one clustering function, adapted from a github layout, that offers various methods depending on visual preference

drows = dist(mat, method = "canberra")
dcols = dist(t(mat), method = "canberra")
callback = function(hc, mat){
  sv = svd(t(mat))$v[,1]
  dend = reorder(as.dendrogram(hc), wts = sv)
  as.hclust(dend)
}

The heat map graphic can now be created, adjust settings by visual inspection:

pheatmap(mat, 
         annotation_col = my_sample_col,
         cluster_cols = FALSE, #'columns are in order of data input
         clustering_distance_rows = drows, clustering_distance_cols = dcols,
         treeheight_row = 0, treeheight_col = 0, #'hide dendrograms
         clustering_callback = callback, cutree_rows = 1, #'separating rows
         color =  inferno(50), #'gradient coloration
         border_color = NA,
         annotation_colors = ann_colors,
         legend_breaks =c(-1.5,0,1.5), legend_labels=c(-1.5,0,1.5))

RNA Seq Pipeline with DESeq2

Matt Kenaston

April 3rd, 2020