Differential expression testing of sets of genes studied in a sequencing experiment is a core component of any single-cell RNAseq analysis workflow. This type of analysis refers to taking the normalized read count data from a sequencing experiment and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, D.E. testing can be used to discern the differences in gene expression between mutant and wildtype cells, in order to study disease-specific gene expression patterns.
This notebook demonstrates how to implement D.E. analysis in two separate modules, both of which achieve the same statistical objective. We also show how to implement several useful visualizations for exploring expression patterns in the results of D.E..
Part I: Manual analysis implementing D.E. with
base R functions and statistical testing. Uses an example
dataset.
Part II: How to perform an equivalent gene
differential expression testing analysis as shown in the previous
base-R example workflow, except using Seurat
differential expression testing functionalities. We implement this
section using the celltype-specific subsets of the complete NSCI
dataset.
Part III: Supplementary visualization module for analyzing results of D.E. testing.
Gene expression analysis of histone deacetylase 1 (HDAC1) knockout mouse
Affymetrix microarray
Dataset: GSE5583 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5583)
Paper: Mol Cell Biol 2006 Nov;26(21):7913-28.
PMID: 16940178 (http://www.ncbi.nlm.nih.gov/pubmed/16940178)
# Read the data into R
dset<-read.table("~/AlzheimersDS/dev-notebooks/GSE5583-data.txt", header = TRUE, sep = "", dec = ".")
data<- as.matrix(dset[,2:7])
geneIDs <- dset[,1]
# Check the loaded dataset
# Dimension of the dataset
dim(data)
## [1] 12488 6
# First few rows
head(data)
## WT.GSM130365 WT.GSM130366 WT.GSM130367 KO.GSM130368 KO.GSM130369
## [1,] 11.5 5.6 69.1 15.7 36.0
## [2,] 20.5 32.4 93.3 31.8 14.4
## [3,] 72.4 89.0 79.2 80.5 130.1
## [4,] 261.0 226.2 365.1 432.0 447.3
## [5,] 1086.2 1555.6 1487.1 1062.2 1365.9
## [6,] 49.7 52.9 15.0 25.8 48.8
## KO.GSM130370
## [1,] 42.0
## [2,] 22.9
## [3,] 86.7
## [4,] 288.1
## [5,] 1436.2
## [6,] 54.8
# Check the behavior of the data
# First reformat data into long format for plotting
plotdata <- as.data.frame(data) %>%
tidyr::gather(key = "cell_line", value = "expression_val")
# Plot histogram
hist1 <- ggplot(plotdata, aes(x=expression_val)) +
geom_histogram( binwidth=50, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Expression Level Distribution - Raw Data") +
labs(x= "Expression level", y = "Count") +
xlim(0,3000)+
theme_minimal()
hist1
# Check the behavior of the data after log-transformation
data2 = log2(data)
# First reformat data into long format for plotting
plotdata <- as.data.frame(data2) %>%
tidyr::gather(key = "cell_line", value = "expression_val")
# Plot histogram
hist2 <- ggplot(plotdata, aes(x=expression_val)) +
geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Expression Level Distribution - Normalized Data") +
labs(x= "Log-normalized expression level", y = "Count") +
theme_minimal()
hist2
Hierarchical clustering of the “samples” based on the correlation coefficients of the expression values
hc = hclust(as.dist(1-cor(data2)))
plot(hc, main="GSE5583 - Hierarchical Clustering")
DEres.df <- as.data.frame(data2)
DEres.df <- cbind.data.frame(geneIDs, DEres.df)
# Separate the two conditions into two smaller data frames
wt = data2[,1:3]
ko = data2[,4:6]
# Compute the means of the samples of each condition
wt.mean = apply(wt, 1, mean)
ko.mean = apply(ko, 1, mean)
# Just get the maximum of all the means for plotting
limit = max(wt.mean, ko.mean)
means <- cbind.data.frame(wt.mean,ko.mean)
# Scatter plot
genescatter <- ggplot(means, aes(x=wt.mean,y=ko.mean)) +
geom_point(alpha= 0.5, color = "azure4") +
# draw linear trend line
geom_smooth(method=lm , color="red", se=FALSE) +
xlab("WT") +
ylab("KO") +
ggtitle("GSE5583 - Scatter") +
xlim(0, limit) + ylim(0, limit)
genescatter
## `geom_smooth()` using formula 'y ~ x'
#compute fold
fold = wt.mean - ko.mean
#add to results dataframe
DEres.df$logFC <- fold
# Generate histogram of the fold differences
foldhist <- ggplot(DEres.df, aes(x=logFC)) +
geom_histogram( binwidth=0.5, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Fold Change Distribution") +
labs(x= "Log fold change in expression level between groups", y = "Count") +
theme_minimal()
foldhist
pvalue = NULL # Empty list for the p-values
tstat = NULL # Empty list of the t test statistics
for(i in 1 : nrow(data)) { # For each gene :
x = wt[i,] # WT of gene number i
y = ko[i,] # KO of gene number i
# Compute t-test between the two conditions
t = t.test(x, y)
# Put the current p-value in the pvalues list
pvalue[i] = t$p.value
# Put the current t-statistic in the tstats list
tstat[i] = t$statistic
}
# add results to results dataframe
DEres.df$pvalue <- pvalue
# Generate -log10(pvalues)
pvallogged <- -log10(pvalue)
# add results to results dataframe
DEres.df$logpval <- pvallogged
# Generate histogram of -log10(pvalues)
pval_hist <- ggplot(DEres.df, aes(x=logpval)) +
geom_histogram( binwidth=0.25, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("-log10(pvalue) Distribution") +
labs(x= "Significance levels occuring in D.E. results", y = "Count") +
xlim(0,4) +
theme_minimal()
pval_hist
volcanoPlot: Function to generate the volcano plot
highlighting significantly differentially-expressed genes using
ggplot2This function takes as input a results dataframe
containing columns of the gene IDs, fold change values and the
corresponding p-values for gene expression significance from a previous
DE analysis. It also expects two user-specified thresholds
pvalue_cutoff and fold_cutoff to render
significance cutoffs on the volcano plot visualization. These latter
values default to 0.01 and 0.5, respectively.
You can optionally also specify a plot title.
This function outputs a ggplot2 object that can then be
further refined with additional ggplot2 aesthetics.
# Volcano Plots:
# put the biological significance (fold changes)
# and statistical significance (p-value) in one plot
# Generate the volcano plot
volcanoPlot <- function(results.df, fold_cutoff=0.5,
pvalue_cutoff=0.01, title =
"Volcano Plot of Gene Expression"){
# Inputs: ##############################################
# results.df: dataframe containing columns:
# 1) "geneIDs": gene IDs
# 2) "logFC": foldchange values from DE analysis
# 3) "logpval": log-transformed p-values from DE analysis
# fold cutoff: significance threshold to render visually on plot;
# denotes fold difference between mutant and wildtype
# also referred to as "biologial signal"
# p_value_cutoff: significance threshold to render visually on plot;
# denotes statistical significance
########################################################
# create factor denoting differential expression status
stats.df <- results.df %>% mutate(diffexpressed =
ifelse(logFC < -(fold_cutoff) &
logpval >= -log10(pvalue_cutoff),
"Significant, downregulated",
ifelse(logFC > fold_cutoff &
logpval >= -log10(pvalue_cutoff),
"Significant, upregulated", "Normally expressed")))
stats.df$diffexpressed <- as.factor(stats.df$diffexpressed)
# Base plot
plot <- ggplot(stats.df, aes(x = logFC, y = logpval, col = diffexpressed, text = geneIDs)) +
geom_point() +
ggtitle(title) +
ylab("-log10(pvalue)") +
geom_vline(xintercept=c(-fold_cutoff, fold_cutoff), col="red") +
geom_hline(yintercept=-log10(pvalue_cutoff), col="darkturquoise") +
scale_color_manual(values=c("black", "blue", "red"),
name = "Differential expression\nstatus") +
theme_minimal()
return(plot)
}
# Volcano: put the biological significance (fold-change)
# and statistical significance (p-value) in one plot
plot <- volcanoPlot(DEres.df, fold_cutoff=0.5, pvalue_cutoff=0.01)
ggplotly(plot, tooltip="text")
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Once the differentially-expressed genes between your conditions of interest have been identified, you may want to save them to a convenient location so they can be easily read in for subsequent analyses.
# Filter genes using dplyr
# Define thresholds
fold_cutoff = 0.5
pvalue_cutoff = 0.01
#filtering pipeline
filtered <- DEres.df %>% filter( abs(logFC) >= fold_cutoff &
pvalue <= pvalue_cutoff )
Show original versus filtered dimensions:
dim(DEres.df)
## [1] 12488 10
dim(filtered)
## [1] 318 10
Now we can save the dataframe of significantly differentially-expressed genes to an R object (.rds) for easy re-loading. Optionally also save the significantly differentially-expressed gene symbols to a text file.
# save to rds
#saveRDS(filtered, "GSE5583_DE.rds")
# save to text
#write.table(filtered, "GSE5583_DE.txt", sep = "\t", quote = FALSE)
We use our filtered DE results from our example GSE5883 dataset to
generate heatmaps by two different R methods.
head(DEres.df)
## geneIDs WT.GSM130365 WT.GSM130366 WT.GSM130367 KO.GSM130368 KO.GSM130369
## 1 100001_at 3.523562 2.485427 6.110614 3.972693 5.169925
## 2 100002_at 4.357552 5.017922 6.543805 4.990955 3.847997
## 3 100003_at 6.177918 6.475733 6.307429 6.330917 7.023477
## 4 100004_at 8.027906 7.821455 8.512148 8.754888 8.805099
## 5 100005_at 10.085074 10.603255 10.538286 10.052840 10.415636
## 6 100006_at 5.635174 5.725196 3.906891 4.689299 5.608809
## KO.GSM130370 logFC pvalue logpval
## 1 5.392317 -0.80511083 0.5449730 0.2636250
## 2 4.517276 0.85435054 0.3253745 0.4876165
## 3 6.437960 -0.27709146 0.3287830 0.4830906
## 4 8.170426 -0.45630111 0.1892376 0.7229925
## 5 10.488041 0.09003287 0.6928410 0.1593664
## 6 5.776104 -0.26898401 0.7180077 0.1438709
Prep data
#select only numeric columns for generating heatmap
filtered <- filtered %>% dplyr::select(2:7)
# coerce df to matrix
filtered <- as.matrix(filtered)
Cluster the rows (genes) & columns (samples) by correlation
rowv = as.dendrogram(hclust(as.dist(1-cor(t(filtered)))))
colv = as.dendrogram(hclust(as.dist(1-cor(filtered))))
Generate a heatmap
# base heatmap
heatmap(filtered, Rowv=rowv, Colv=colv, cexCol=0.7)
# Enhanced heatmap
heatmap.2(filtered, Rowv=rowv, Colv=colv, cexCol=0.7,
col = rev(redblue(256)), scale = "row")
SeuratThis section demonstrates how to perform an equivalent gene
differential expression testing analysis as shown in the previous
base-R example workflow, except using Seurat
built-in differential expression testing functionalities. In
Seurat, differential expression testing between two
populations in a dataset is implemented with the
FindMarkers() function. As a default, Seurat
performs differential expression based on the non-parametric
Wilcoxon rank sum test.
Seurat
objectReplace the code Ast (for “astrocytes”) below with the code for your assigned cell type. Optionally, also replace the experimental time point at which you want to analyze differential expression.
# specify your cell type
my_celltype <- "Ast"
# specify time point of interest
my_timept <- "2mo"
Now, load in the data from your celltype-specific Seurat
object, which should be saved in a file in your
student-notebooks directory with a filename in the format
"FinalMergedData-my_celltype.rds"(where
my-celltype is your celltype code). Change the filepath in
the code below to match the location of the file in your workspace,
wherever you have it saved. For a reminder of how this file should have
been generated, see your copy of
Notebook4-Data-Access-and-Extraction.Rmd, description under
the section “Day 3 Take Home Assignment”*.
# Load Seurat data
# Change the filepath in quotes "" below to match where you
# have your celltype data saved
# The expected filepath is "~/AlzheimersDS/student-notebooks/FinalMergedData-my_celltype.rds"
# (where `my-celltype` is your celltype code)
my_celltype_data <- readRDS("/data/Alzheimers_DS/FinalMergedData-Ast.rds")
Results derived from D.E. will be most meaningful if we separate our celltype data by the experimental timepoint at which it was collected. The code below will make a subset of your celltype-specific Seurat object by time, creating an object containing only the expression data for cells of your cell type, at the specified timepoint.
# Subset my celltype-specific Seurat object by time
my_seurat <- subset(x = my_celltype_data, subset = Age == my_timept)
nrow(my_seurat)
## [1] 40096
Find differentially-expressed genes between the conditions
"V337M" (mutant) versus "V337V" (wildtype) for
this celltype. Perform the DE analysis using the default recommended
test (Wilcoxon rank sum).
Recall that by default, this Seurat function will draw
on the normalized expression values in the Seurat object
slot data. See Notebook 2 for further explanation of
the different data slots in Seurat objects.
# set active identities of my subset seurat to "Mt" attribute (denoting variant)
# for DE testing
Idents(my_seurat) <- "Mt"
summary(as.factor(my_seurat$Mt))
## V337M V337V
## 81 98
# Run D.E. test to identify genes differentially expressed between
# between diseased and normal cells, for my celltype and at this timepoint
marker_genes.df <- FindMarkers(my_seurat,
ident.1 = "V337M", ident.2 = "V337V")
#save results
saveRDS(marker_genes.df, sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept))
# order results by p-value
marker_genes.df <- marker_genes.df %>% arrange(p_val)
head(marker_genes.df)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## HMGXB4 2.964098e-05 -0.4700985 0.247 0.541 1
## SNRPGP10 6.754363e-05 0.3308939 0.222 0.031 1
## TTBK2 1.044893e-04 -0.3491722 0.049 0.265 1
## RAB28 1.456217e-04 0.3951541 0.383 0.143 1
## HIRIP3 3.108470e-04 -0.2753372 0.037 0.224 1
## DOK5 3.644166e-04 0.4949083 0.617 0.357 1
Plot results distributions
# Plot pvalues
marker_genes_hist <- ggplot(marker_genes.df, aes(x=p_val)) +
geom_histogram( binwidth=0.1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle(sprintf("P-Value Distribution for D.E. Genes in %s at Timepoint %s",my_celltype,my_timept)) +
labs(x= "P Value", y = "Count") +
theme_minimal()
marker_genes_hist
# Plot log fold changes in expression from mt to normal
marker_genes_hist2 <- ggplot(marker_genes.df, aes(x=avg_log2FC)) +
geom_histogram( binwidth=0.1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle(sprintf("LogFC Distribution for D.E. Genes in %s at Timepoint %s",my_celltype,my_timept)) +
labs(x= "logFC", y = "Count") +
theme_minimal()
marker_genes_hist2
In this context, positive log fold change values indicate that the gene is more highly expressed in mutant cells relative to normal, and vice versa for negative values.
Set significance thresholds and generate volcano plot
# set thresholds
fold_cutoff = 0.5
pvalue_cutoff = 0.01
# format a new dataframe using the contents of
# marker_genes.df, to pass to volcanoPlot()"
plot.df <- marker_genes.df %>%
tibble::rownames_to_column("geneIDs") %>%
dplyr::select(geneIDs, avg_log2FC, p_val) %>%
mutate(logpval = -log10(p_val))
colnames(plot.df) <- c("geneIDs","logFC","pvalue","logpval")
# generate the plot
plot <- volcanoPlot(plot.df, fold_cutoff, pvalue_cutoff,
title = sprintf("Gene Expression in %s Cells (%s)",my_celltype,my_timept))
ggplotly(plot, tooltip="text")
# Filter output of FindMarkers() genes using dplyr
# Define thresholds
fold_cutoff = 0.5
pvalue_cutoff = 0.01
#filtering pipeline
filtered <- DEres.df %>% filter( abs(logFC) >= fold_cutoff &
pvalue <= pvalue_cutoff )
Save the dataframe of the output of running FindMarkers
to get significantly differentially-expressed genes in your celltype to
an R object (.rds) for easy re-loading. Optionally also save the
significantly differentially-expressed gene symbols to a text file.
# save to rds
saveRDS(marker_genes.df, sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept))
# save to text
write.table(x = filtered, file = sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept), sep = "\t",
quote = FALSE)
# specify another time points of interest
my_timept_2 <- "4mo"
my_timept_3 <- "6mo"
# Subset my celltype-specific Seurat object by times
my_seurat_2 <- subset(x = my_celltype_data, subset = Age == my_timept_2)
my_seurat_3 <- subset(x = my_celltype_data, subset = Age == my_timept_3)
nrow(my_seurat_2)
## [1] 40096
nrow(my_seurat_3)
## [1] 40096
# set active identities of my subset seurat to "Mt" attribute (denoting variant)
# for DE testing
Idents(my_seurat_2) <- "Mt"
Idents(my_seurat_3) <- "Mt"
summary(as.factor(my_seurat_2$Mt))
## V337M V337V
## 83 93
summary(as.factor(my_seurat_3$Mt))
## V337M V337V
## 120 206
# Run D.E. test to identify genes differentially expressed between
# between diseased and normal cells, for my celltype and at this timepoint
marker_genes_2.df <- FindMarkers(my_seurat_2,
ident.1 = "V337M", ident.2 = "V337V")
marker_genes_3.df <- FindMarkers(my_seurat_3,
ident.1 = "V337M", ident.2 = "V337V")
#save results
saveRDS(marker_genes_2.df, sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept_2))
saveRDS(marker_genes_3.df, sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept_3))
# order results by p-value
marker_genes_2.df <- marker_genes_2.df %>% arrange(p_val)
head(marker_genes_2.df)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## RPL9 2.436679e-06 -0.4309564 1.000 1.000 0.09770109
## CHCHD2 4.873454e-05 0.9258074 0.639 0.452 1.00000000
## RPS5 6.022260e-05 -0.3538973 1.000 1.000 1.00000000
## RPL24 6.876765e-05 -0.2768466 1.000 1.000 1.00000000
## RPS13 8.903810e-05 -0.3510350 1.000 1.000 1.00000000
## HSP90AB1 1.384171e-04 -0.3997375 1.000 1.000 1.00000000
marker_genes_3.df <- marker_genes_3.df %>% arrange(p_val)
head(marker_genes_3.df)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FAM183A 1.057416e-06 0.3126794 0.208 0.039 0.04239813
## RSPH1 6.416327e-06 0.5408231 0.292 0.107 0.25726906
## SDR39U1 8.923202e-06 0.2776002 0.367 0.146 0.35778471
## CHCHD2 7.523072e-05 0.5406251 0.742 0.534 1.00000000
## GSTO1 1.188225e-04 0.2758435 0.483 0.277 1.00000000
## KIF5B 2.266152e-04 -0.2973771 0.917 0.961 1.00000000
# format a new dataframe using the contents of
# marker_genes.df, to pass to volcanoPlot()
plot_2.df <- marker_genes_2.df %>%
tibble::rownames_to_column("geneIDs") %>%
dplyr::select(geneIDs, avg_log2FC, p_val) %>%
mutate(logpval = -log10(p_val))
colnames(plot_2.df) <- c("geneIDs","logFC","pvalue","logpval")
# generate the plot
plot_2 <- volcanoPlot(plot_2.df, fold_cutoff, pvalue_cutoff,
title = sprintf("Gene Expression in %s Cells (%s)",my_celltype,my_timept_2))
plot_3.df <- marker_genes_3.df %>%
tibble::rownames_to_column("geneIDs") %>%
dplyr::select(geneIDs, avg_log2FC, p_val) %>%
mutate(logpval = -log10(p_val))
colnames(plot_3.df) <- c("geneIDs","logFC","pvalue","logpval")
# generate the plot
plot_3 <- volcanoPlot(plot_3.df, fold_cutoff, pvalue_cutoff,
title = sprintf("Gene Expression in %s Cells (%s)",my_celltype,my_timept_3))
# render plots individually as interactives
ggplotly(plot, tooltip="text")
ggplotly(plot_2, tooltip="text")
ggplotly(plot_3, tooltip="text")
plot_all_time <- ggarrange(plot, plot_2, plot_3,
labels = c("A", "B", "C"),
ncol = 2, nrow = 2, widths = c(1, 1), common.legend = TRUE)
plot_all_time
plot_a <- ggplotly(plot, tooltip="text")
plot_b <- ggplotly(plot_2, tooltip="text")
plot_c <- ggplotly(plot_3, tooltip="text")
fig <- subplot(plot_a, plot_b, plot_c, nrows = 2, margin = 0.05) %>%
layout(showlegend = F, title = sprintf('How expressed/repressed genes changes over time across %s', my_celltype), annotations = list(
list(x = 0.2 , y = 1, text = "2mo", showarrow = F, xref='paper', yref='paper'),
list(x = 0.75 , y = 1, text = "4mo", showarrow = F, xref='paper', yref='paper'),
list(x = 0.2 , y = 0.48, text = "6mo", showarrow = F, xref='paper', yref='paper')))
fig
# Helper function to
# filter results of FindMarkers() using dplyr
filterGenes <- function(DEres.df, fold_cutoff = 0.5,
pvalue_cutoff = 0.01) {
#filtering pipeline to reduce results dataframe
# to only the genes defined to be significant
# by the thresholds specified
filtered <- DEres.df %>% filter( abs(avg_log2FC) >= fold_cutoff &
p_val <= pvalue_cutoff )
return(filtered)
}
# change these thresholds if desired
fold_cutoff = 0.5
pvalue_cutoff = 0.01
# filter DE results for each time point
DEres2mo_filtered <- filterGenes(marker_genes.df, fold_cutoff, pvalue_cutoff)
DEres4mo_filtered <- filterGenes(marker_genes_2.df, fold_cutoff, pvalue_cutoff)
DEres6mo_filtered <- filterGenes(marker_genes_3.df, fold_cutoff, pvalue_cutoff)
head(DEres2mo_filtered)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## HMGN3 0.001051848 0.5064855 0.938 0.888 1
## PCLO 0.002314014 -0.5419183 0.198 0.398 1
## PTGDS 0.004890413 -1.0792829 0.185 0.367 1
## MT-RNR2 0.005921988 -0.5203609 1.000 1.000 1
## KIDINS220 0.006594584 -0.5072945 0.481 0.622 1
## RUNX1T1 0.007386023 -0.5372801 0.049 0.173 1
Gene count at 2 months before filtering:
nrow(marker_genes.df)
## [1] 334
Gene count at 2 months after filtering:
nrow(DEres2mo_filtered)
## [1] 8
We generate a dot plot to help summarize which
differentially-expressed genes are represented at each timepoint, using
the same significance criteria as shown previously on the volcano plots.
Here we plot only significant genes, and create a factor
DE_status denoting whether the gene is significantly
upregulated or downregulated. DE_status is then used to
color the data points. The magnitude of differential expression for each
gene is denoted by its avg_log2FC on the y-axis.
# create columns for gene id
DEres2mo_filtered<- DEres2mo_filtered %>% tibble::rownames_to_column("gene")
DEres4mo_filtered<- DEres4mo_filtered %>% tibble::rownames_to_column("gene")
DEres6mo_filtered<- DEres6mo_filtered %>% tibble::rownames_to_column("gene")
# create columns for timepoint
DEres2mo_filtered$timepoint <- rep(2,nrow(DEres2mo_filtered))
DEres4mo_filtered$timepoint <- rep(4,nrow(DEres4mo_filtered))
DEres6mo_filtered$timepoint <- rep(6,nrow(DEres6mo_filtered))
# vertically combine or "stack" the dataframes
topDE_genes.df <- dplyr::bind_rows(DEres2mo_filtered,
DEres4mo_filtered,
DEres6mo_filtered)
# create column for DE status
topDE_genes.df <- topDE_genes.df %>% mutate( DE_status =
ifelse(avg_log2FC < -(fold_cutoff) &
p_val < pvalue_cutoff,
"Significant, downregulated",
ifelse(avg_log2FC > fold_cutoff &
p_val < pvalue_cutoff,
"Significant, upregulated", "Normally expressed")))
# factorize relevant columns
topDE_genes.df$timepoint <- as.factor(topDE_genes.df$timepoint)
topDE_genes.df$DE_status <- as.factor(topDE_genes.df$DE_status)
dim(topDE_genes.df)
## [1] 25 8
p <- ggplot(topDE_genes.df, aes(x = timepoint, y = avg_log2FC))+
geom_point(aes(color = DE_status), size = 2) +
geom_text_repel(label = topDE_genes.df$gene,
max.overlaps = getOption("ggrepel.max.overlaps",
default = 20)) +
labs(title = sprintf("Genes Differentially Expressed between Mt and Wt in %s Cells",
my_celltype),
x = "Developmental timepoint (months)") +
theme_minimal()
p
We can use Gene Ontology Enrichment to highlight enriched biological functions of the genes identified to be significant, separately for each timepoint.
# Define function to perform the GO enrichment analysis
# Summarize most prominant biological functions
# of input genes in humans
getGOTerms <- function(genes) {
require(gprofiler2)
## Step 1: Run GO analysis on gene list
# Results from the query are stored in GOres$result
GOres <- gprofiler2::gost(query = c(genes),
organism = "hsapiens", ordered_query = TRUE,
multi_query = FALSE, significant = TRUE, user_threshold = 0.05,
source = c("GO:BP"))
# Make a dataframe of major Gene Ontology terms
# and their p-values
# Note we order the results by increasing p-value
# (most significant come first)
if(!is.null(GOres)){
GOres.df <- GOres$result %>%
dplyr::select(term_name,term_id,p_value) %>%
arrange(p_value)
} else { GOres.df <- NULL }
return(GOres.df)
}
# Optionally re-define significance thresholds
fold_cutoff <- 0.3
pvalue_cutoff <- 0.01
# Create input gene lists
DEgenes2<- marker_genes.df %>% filter(abs(avg_log2FC) > fold_cutoff &
p_val < pvalue_cutoff)
DEgenes4<- marker_genes_2.df %>% filter(abs(avg_log2FC) > fold_cutoff &
p_val < pvalue_cutoff)
DEgenes6<- marker_genes_3.df %>% filter(abs(avg_log2FC)> fold_cutoff &
p_val < pvalue_cutoff)
genes2 <- rownames(DEgenes2)
genes4 <- rownames(DEgenes4)
genes6 <- rownames(DEgenes6)
gene_lists <- list(genes2,genes4,genes6)
# remove any empty gene sets
#(sets with no significant genes found)
gene_lists <- gene_lists[!sapply(gene_lists,is.null)]
# `goresults` will be a list of 3 dataframes,
# each one containing the GO output for one timepoint.
# Results data frames will be NULL if Gene Ontology
# does not find significant terms for those genes.
goresults <- lapply(X = gene_lists,
FUN = function(gl) {
terms <- getGOTerms(gl)
return(terms)} )
## Gene Ontology found significantly-enriched biological
## processes for timepoints: 4 months
| term_name | term_id | p_value |
|---|---|---|
| cytoplasmic translation | GO:0002181 | 0.0018601 |
| trachea cartilage morphogenesis | GO:0060535 | 0.0091736 |
| developmental growth involved in morphogenesis | GO:0060560 | 0.0322284 |
ggplot2######################################################
# Volcano Plot using ggplot2
######################################################
## set thresholds
fold_cutoff = 0.5
pvalue_cutoff = 0.01
# Prepare plot data
my_marker_genes.df <- marker_genes.df %>%
dplyr::select(avg_log2FC, p_val) %>%
mutate(diffexpressed = ifelse(avg_log2FC < -(fold_cutoff) & p_val <= pvalue_cutoff, "Significant, downregulated",
ifelse(avg_log2FC > fold_cutoff & p_val <= pvalue_cutoff, "Significant, upregulated", "Normally expressed")))
# convert rownames to a column of the dataframe called "gene"
my_marker_genes.df <- cbind(gene = rownames(marker_genes.df), my_marker_genes.df)
rownames(my_marker_genes.df) <- 1:nrow(my_marker_genes.df)
# create an empty "label" column
my_marker_genes.df$label <- NA
# Transform "label" column into a list of the genes to plot labels for
my_marker_genes.df$label[my_marker_genes.df$diffexpressed != "Normally expressed"] <- my_marker_genes.df$gene[my_marker_genes.df$diffexpressed != "Normally expressed"]
# Plot Volcano plots using ggplot
ggplot(data=my_marker_genes.df, aes(x=avg_log2FC, y=-log10(p_val), col=diffexpressed, label=label)) +
geom_point() +
theme_minimal() +
geom_text_repel() +
scale_color_manual(values=c("black", "blue", "red")) +
geom_vline(xintercept=c(-(fold_cutoff), fold_cutoff), col="red") +
geom_hline(yintercept=-log10(pvalue_cutoff), col="red") +
ggtitle("Volcano Plot of Gene Expression")