The following are necessary packages for this script and must be installed with all accompanying packages:
suppressWarnings(suppressMessages({
library('easypackages',warn.conflicts = FALSE, quietly=TRUE)
library("readr",warn.conflicts = FALSE, quietly=TRUE)
library("ggplot2",warn.conflicts = FALSE, quietly=TRUE)
library("RColorBrewer",warn.conflicts = FALSE, quietly=TRUE)
library("biomaRt",warn.conflicts = FALSE, quietly=TRUE)
library("DESeq2",warn.conflicts = FALSE, quietly=TRUE)
library("pheatmap",warn.conflicts = FALSE, quietly=TRUE)
library("RColorBrewer",warn.conflicts = FALSE, quietly=TRUE)
library("viridis",warn.conflicts = FALSE, quietly=TRUE)
library("ggpubr",warn.conflicts = FALSE, quietly=TRUE)
library('EnhancedVolcano',warn.conflicts = FALSE, quietly=TRUE)}))
If any of these fail to load properly, check current R version and make sure all are correctly installed. I am currently using R 3.6.3 to ensure all packages are compatible
GOTerm Pipeline
This script can be used to manually select Gene Ontology terms for project filtering RNASeq raw data input goes here depending on your current working directory or project
source <- "1 MWK/Research/Shah Lab/RNAseq/PS11/DENV_PS11[7945].csv"
The following code recommends data from STAR alignment, including a column of ENSEMBL ID’s and gene names, otherwise script will need to be adjusted
raw<-read.csv(source, header=TRUE, row.names = 'gene_ref') #'making ENSMBL ID's rownames
Gene names can be removed from this dataset to allow for DESeq2 analysis data frame:
raw$Name <- NULL
Removing any data with ALL 0 reads, which can be made more stringent if desired:
raw <- raw [which(rowSums(raw) > 0), ]
head(raw)
## DV1 DV2 DV3 Mock1 Mock2 Mock3
## ENSG00000000003 1800 1526 1583 2588 1580 2281
## ENSG00000000005 2 0 0 1 2 5
## ENSG00000000419 2409 1993 2250 2110 1407 1926
## ENSG00000000457 449 367 370 358 251 399
## ENSG00000000460 479 418 398 637 393 490
## ENSG00000000971 1182 1166 1141 1680 1182 1539
Importing data again for use in attributing gene names to ENSEMBL ID’s for various figures and isolating ENSEMBL ID and Gene Names ONLY of all genes in the data set:
id <- read.csv(source, header=TRUE)
id <- id[,c(1,2)]
head(id)
## gene_ref Name
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000005 TNMD
## 3 ENSG00000000419 DPM1
## 4 ENSG00000000457 SCYL3
## 5 ENSG00000000460 C1orf112
## 6 ENSG00000000938 FGR
Biomart offers various sets of gene names and associations. Choose which is compatible with your data.
In this case, the IDs are ENSEMBL and the genes of interest are human.
If you don’t want to filter by GO terms, then simply exclude filter and value arguments, or bypass this portion of code completely.
The script matches GO terms to ENSEMBL ID’s , filtering out all noninterested genes
bm <- useMart("ensembl")
bm <- useDataset("hsapiens_gene_ensembl", mart=bm)
go <- getBM(mart=bm, attributes=c('ensembl_gene_id','go_id'),filter ="go_parent_term",
values = c("GO:0002376"))
## Cache found
df <- data.frame(go, raw[match(go$ensembl_gene_id, row.names(raw)),])
Now, the data frame is cleaned up for analysis
Remove name column
Create a data matrix for DESeq2 interpretation
Remove NA values and duplicated GO terms
df <- df[,!names(df) %in% c("ensembl_gene_id", "go_id")]
df <- as.matrix(df)
df <- na.omit(df)
df <- df[!duplicated(df), ]
STOP HERE and REVIEW for ERRORS
head(df)
## DV1 DV2 DV3 Mock1 Mock2 Mock3
## ENSG00000109107 256 234 240 356 234 287
## ENSG00000134545 5 8 4 1 4 3
## ENSG00000108984 938 842 847 1100 802 994
## ENSG00000114021 785 599 611 1031 676 952
## ENSG00000143772 0 0 0 0 0 1
## ENSG00000244731 0 1 1 2 2 0
DESeq2 Pipeline Analysis
First, differentiating between tests (exp) and mock (ctl) trials, which is dependent on the number of different conditions and the number of replicates per condition
condition <- factor(c(rep("DV2", 3), rep("Mock", 3)))
head(condition)
## [1] DV2 DV2 DV2 Mock Mock Mock
## Levels: DV2 Mock
Now, establish a correlation between experiment and mock trials for DESeq2 to use for analysis With this info, a DESeq Data Set can be created from our matrix and condition objects:
coldata <- data.frame(row.names=colnames(df), condition)
dds <- DESeqDataSetFromMatrix(countData=df, colData=coldata, design=~condition)
dds
## class: DESeqDataSet
## dim: 2313 6
## metadata(1): version
## assays(1): counts
## rownames(2313): ENSG00000109107 ENSG00000134545 ... ENSG00000029993
## ENSG00000197879
## rowData names(0):
## colnames(6): DV1 DV2 ... Mock2 Mock3
## colData names(1): condition
IMPORTANT - running pipeline
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
The following will collapse replicates into single data entities, but you can skip this if you want to visualize replicates independently
dds <- collapseReplicates(dds, dds$condition)
Plot dispersions to qualitatively check data is appropriate for further analysis:
plotDispEsts(dds, main="Dispersion plot")
Differential expression results are produced to graphically view the data. The results are then organized by most significant genes and merged with normalized data:
res <- results(dds)
table(res$padj<0.05)
##
## FALSE TRUE
## 1155 1023
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
## Gene baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSG00000102580 3925.7724 -1.823522 0.05607761 -32.51782 5.970764e-232
## 2 ENSG00000120129 1120.7174 -2.712696 0.09572161 -28.33943 1.129688e-176
## 3 ENSG00000168003 4785.4536 -1.544300 0.05468533 -28.23974 1.902337e-175
## 4 ENSG00000116584 4236.4371 -1.902055 0.06878449 -27.65237 2.613290e-168
## 5 ENSG00000168209 698.4339 -3.302805 0.12143665 -27.19776 6.903021e-163
## 6 ENSG00000172216 1850.8910 -2.017493 0.07427333 -27.16308 1.774055e-162
## padj DV2 Mock
## 1 1.300432e-228 17363.734 4385.3389
## 2 1.230231e-173 5518.038 759.2210
## 3 1.381096e-172 20217.496 6179.1926
## 4 1.422937e-165 18933.207 4511.1947
## 5 3.006956e-160 3595.434 323.6292
## 6 6.439818e-160 8420.607 1866.5885
The results can be exported for other use outside of R:
write.csv(resdata, file="diffexpr-results.csv")
Significant genes can be exported as a single list for over representation analysis on various GO servers, such as the PANTHER database
sig <- resdata[resdata$padj<0.05,]
sig <- na.omit(sig)
write.csv2(sig$Gene, file="GO-enrich.txt", quote=FALSE, row.names=FALSE)
Examine plot of p-values, this will again show you if something is very wrong with your data or if there is little change from the control
STOP HERE and REVIEW for ERRORS
hist(res$pvalue, breaks=50, col="grey", main = "P-value Distribution")
As an example, the differing counts of genes can be viewed, here it is the most significant gene:
topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, topGene, "condition")
Volcano Plot
A volcano plot will now be used to analyze the distribution of genes and significant genes.
The experimental conditions must first be compared to mock conditions for fold change.
Familiar gene IDs can be matched using the raw data to ENSEMBL IDs.
res1 <- results(dds,
contrast = c('condition','DV2','Mock'))
res1 <- lfcShrink(dds,
contrast = c('condition','DV2','Mock'), res=res1)
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
##
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
## Warning in qf(0.99, p, m - p): NaNs produced
res1_names <- data.frame(res1, id[match(row.names(res1), id$gene_ref),])
There are a number of parameters that can altered with the Volcano Plot in this particular package. The most important values to select are cutoffs and whether or not labeling is necessary.
EnhancedVolcano(res1,
lab = res1_names$Name,
x = 'log2FoldChange',
y = 'pvalue',
pCutoff = 0.01,
FCcutoff = 2,
xlim = c(-2, 4),
col=c('#f6d746', 'black', '#e55c30', '#84206b'), labSize=3.0, colAlpha=.4,
title="Mock vs. Infected Immune Response", subtitle = NULL,
gridlines.major = FALSE, gridlines.minor = FALSE, boxedLabels = TRUE,
drawConnectors = TRUE, widthConnectors = 0.25, colConnectors = 'black',
endsConnectors= 'last')
MA plot
There isn’t much difference with this plot than volcano plots, but it can be a better visualization at times. Note the similar parameters:
maplot = ggmaplot(res1, fdr = 0.01, fc = 1, size = 1,
palette = c("#e55c30", "#84206b", "#f6d746"),
genenames = as.vector(res1_names$Name),
legend="top", top = 15,font.label = c("bold", 11),
label.rectangle = TRUE, font.legend = c("bold",12), font.main = "bold",
xlab = "Log2 Mean Expression", ylab="Log2 FC")
show(maplot)
Heatmap
The data is normalized so that genes can be compared to one another regardless of lower relative gene counts
res <- results(dds)
vsd <- varianceStabilizingTransformation(dds)
Define an assay of most significantly divergent genes. The number here determines how many genes the heatmap will display. Then, familiar gene IDs are attached similar to the volcano/MA plots.
mat <- assay(vsd)[ head(order(res$padj),25), ]
mat <- mat - rowMeans(mat)
mat <- data.frame(mat, id[match(row.names(mat), id$gene_ref),])
rownames(mat) <- mat$Name
mat <- mat[,!names(mat) %in% c("gene_ref", "Name")]
The order of columns on the heatmap can be manually changed:
mat<- mat[,c(2,1)]
The controls and tests can be differentiated, but be careful to match the order of treatments. The annotations can also be given a color scheme.
my_sample_col <- data.frame(Treatment = rep(c("Ctl", "Inf"), c(1,1)))
row.names(my_sample_col) <- colnames(mat)
ann_colors = list(
Treatment = c('Inf' = "black", 'Ctl' = "dark grey"))
Here is one clustering function, adapted from a github layout, that offers various methods depending on visual preference
drows = dist(mat, method = "canberra")
dcols = dist(t(mat), method = "canberra")
callback = function(hc, mat){
sv = svd(t(mat))$v[,1]
dend = reorder(as.dendrogram(hc), wts = sv)
as.hclust(dend)
}
The heat map graphic can now be created, adjust settings by visual inspection:
pheatmap(mat,
annotation_col = my_sample_col,
cluster_cols = FALSE, #'columns are in order of data input
clustering_distance_rows = drows, clustering_distance_cols = dcols,
treeheight_row = 0, treeheight_col = 0, #'hide dendrograms
clustering_callback = callback, cutree_rows = 1, #'separating rows
color = inferno(50), #'gradient coloration
border_color = NA,
annotation_colors = ann_colors,
legend_breaks =c(-1.5,0,1.5), legend_labels=c(-1.5,0,1.5))