Overview

Differential expression testing of sets of genes studied in a sequencing experiment is a core component of any single-cell RNAseq analysis workflow. This type of analysis refers to taking the normalized read count data from a sequencing experiment and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, D.E. testing can be used to discern the differences in gene expression between mutant and wildtype cells, in order to study disease-specific gene expression patterns.

This notebook demonstrates how to implement D.E. analysis in two separate modules, both of which achieve the same statistical objective. We also show how to implement several useful visualizations for exploring expression patterns in the results of D.E..

Part I: Manual analysis implementing D.E. with base R functions and statistical testing. Uses an example dataset.
Part II: How to perform an equivalent gene differential expression testing analysis as shown in the previous base-R example workflow, except using Seurat differential expression testing functionalities. We implement this section using the celltype-specific subsets of the complete NSCI dataset.
Part III: Supplementary visualization module for analyzing results of D.E. testing.

Part I: Manual Differential Expression Testing

Gene expression analysis of histone deacetylase 1 (HDAC1) knockout mouse
Affymetrix microarray
Dataset: GSE5583 (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5583)
Paper: Mol Cell Biol 2006 Nov;26(21):7913-28.
PMID: 16940178 (http://www.ncbi.nlm.nih.gov/pubmed/16940178)

Load GSE5583 data

# Read the data into R
dset<-read.table("~/AlzheimersDS/dev-notebooks/GSE5583-data.txt", header = TRUE, sep = "", dec = ".")
data<- as.matrix(dset[,2:7])
geneIDs <- dset[,1]
# Check the loaded dataset
# Dimension of the dataset
dim(data)

## [1] 12488     6

# First few rows
head(data)

##      WT.GSM130365 WT.GSM130366 WT.GSM130367 KO.GSM130368 KO.GSM130369
## [1,]         11.5          5.6         69.1         15.7         36.0
## [2,]         20.5         32.4         93.3         31.8         14.4
## [3,]         72.4         89.0         79.2         80.5        130.1
## [4,]        261.0        226.2        365.1        432.0        447.3
## [5,]       1086.2       1555.6       1487.1       1062.2       1365.9
## [6,]         49.7         52.9         15.0         25.8         48.8
##      KO.GSM130370
## [1,]         42.0
## [2,]         22.9
## [3,]         86.7
## [4,]        288.1
## [5,]       1436.2
## [6,]         54.8

Exploratory data analysis

# Check the behavior of the data
# First reformat data into long format for plotting
plotdata <- as.data.frame(data) %>%  
  tidyr::gather(key = "cell_line", value = "expression_val")
# Plot histogram
hist1 <- ggplot(plotdata, aes(x=expression_val)) +
    geom_histogram( binwidth=50, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("Expression Level Distribution - Raw Data") +
   labs(x= "Expression level", y = "Count") +
  xlim(0,3000)+
    theme_minimal() 
hist1

# Check the behavior of the data after log-transformation
data2 = log2(data)
# First reformat data into long format for plotting
plotdata <- as.data.frame(data2) %>%  
  tidyr::gather(key = "cell_line", value = "expression_val")
# Plot histogram
hist2 <- ggplot(plotdata, aes(x=expression_val)) +
    geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("Expression Level Distribution - Normalized Data") +
   labs(x= "Log-normalized expression level", y = "Count") +
    theme_minimal() 
hist2

Hierarchical clustering of the “samples” based on the correlation coefficients of the expression values

hc = hclust(as.dist(1-cor(data2)))
plot(hc, main="GSE5583 - Hierarchical Clustering")

Create a dataframe to store D.E. results

DEres.df <- as.data.frame(data2)
DEres.df <- cbind.data.frame(geneIDs, DEres.df)

Run differential expression (DE) analysis

# Separate the two conditions into two smaller data frames
wt = data2[,1:3]
ko = data2[,4:6]
# Compute the means of the samples of each condition
wt.mean = apply(wt, 1, mean)
ko.mean = apply(ko, 1, mean)
# Just get the maximum of all the means for plotting
limit = max(wt.mean, ko.mean)
means <- cbind.data.frame(wt.mean,ko.mean)
# Scatter plot
genescatter <- ggplot(means, aes(x=wt.mean,y=ko.mean)) + 
                        geom_point(alpha= 0.5, color = "azure4") +
                         # draw linear trend line
                        geom_smooth(method=lm , color="red", se=FALSE) +
                        xlab("WT") + 
                        ylab("KO") +
     ggtitle("GSE5583 - Scatter") +
     xlim(0, limit) + ylim(0, limit)
                        
genescatter

## `geom_smooth()` using formula 'y ~ x'

Compute fold-change difference between the means of the conditions

#compute fold
fold = wt.mean - ko.mean
#add to results dataframe
DEres.df$logFC <- fold
# Generate histogram of the fold differences
foldhist <- ggplot(DEres.df, aes(x=logFC)) +
    geom_histogram( binwidth=0.5, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("Fold Change Distribution") +
   labs(x= "Log fold change in expression level between groups", y = "Count") +
    theme_minimal() 
foldhist

Compute statistical significance (using t-test)

pvalue = NULL # Empty list for the p-values
tstat = NULL # Empty list of the t test statistics
for(i in 1 : nrow(data)) { # For each gene : 
  x = wt[i,] # WT of gene number i
  y = ko[i,] # KO of gene number i
  
  # Compute t-test between the two conditions
  t = t.test(x, y)
  
  # Put the current p-value in the pvalues list
  pvalue[i] = t$p.value
  # Put the current t-statistic in the tstats list
  tstat[i] = t$statistic
}
# add results to results dataframe
DEres.df$pvalue <- pvalue

# Generate -log10(pvalues)
pvallogged <- -log10(pvalue)
# add results to results dataframe
DEres.df$logpval <- pvallogged
# Generate histogram of -log10(pvalues)
pval_hist <- ggplot(DEres.df, aes(x=logpval)) +
    geom_histogram( binwidth=0.25, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("-log10(pvalue) Distribution") +
   labs(x= "Significance levels occuring in D.E. results", y = "Count") +
  xlim(0,4) +
    theme_minimal() 
pval_hist

Generate Volcano Plots

`volcanoPlot`: Function to generate the volcano plot highlighting significantly differentially-expressed genes using `ggplot2`

This function takes as input a results dataframe containing columns of the gene IDs, fold change values and the corresponding p-values for gene expression significance from a previous DE analysis. It also expects two user-specified thresholds pvalue_cutoff and fold_cutoff to render significance cutoffs on the volcano plot visualization. These latter values default to 0.01 and 0.5, respectively. You can optionally also specify a plot title.

This function outputs a ggplot2 object that can then be further refined with additional ggplot2 aesthetics.

# Volcano Plots:
# put the biological significance (fold changes)
# and statistical significance (p-value) in one plot
# Generate the volcano plot
volcanoPlot <- function(results.df, fold_cutoff=0.5,
                        pvalue_cutoff=0.01, title = 
                          "Volcano Plot of Gene Expression"){
  # Inputs: ##############################################
  
  # results.df: dataframe containing columns: 
  #             1) "geneIDs": gene IDs
  #             2) "logFC": foldchange values from DE analysis
  #             3) "logpval": log-transformed p-values from DE analysis
  # fold cutoff: significance threshold to render visually on plot;
  #               denotes fold difference between mutant and wildtype
  #               also referred to as "biologial signal"
  # p_value_cutoff: significance threshold to render visually on plot;
  #                 denotes statistical significance
  ########################################################
  # create factor denoting differential expression status
  stats.df <- results.df %>% mutate(diffexpressed = 
                                  ifelse(logFC < -(fold_cutoff) & 
                                  logpval >= -log10(pvalue_cutoff),
                                   "Significant, downregulated", 
                                ifelse(logFC > fold_cutoff & 
                                   logpval >= -log10(pvalue_cutoff), 
                                   "Significant, upregulated", "Normally expressed")))
  
  stats.df$diffexpressed <- as.factor(stats.df$diffexpressed)
  # Base plot
  plot <- ggplot(stats.df, aes(x = logFC, y = logpval, col = diffexpressed, text = geneIDs)) +
                 geom_point() +
                 ggtitle(title) + 
                 ylab("-log10(pvalue)") +
                 geom_vline(xintercept=c(-fold_cutoff, fold_cutoff), col="red") +
                 geom_hline(yintercept=-log10(pvalue_cutoff), col="darkturquoise") +
                 scale_color_manual(values=c("black", "blue", "red"), 
                                    name = "Differential expression\nstatus") +
                 theme_minimal()
  return(plot)
  
}

# Volcano: put the biological significance (fold-change)
# and statistical significance (p-value) in one plot
plot <- volcanoPlot(DEres.df, fold_cutoff=0.5, pvalue_cutoff=0.01)
ggplotly(plot, tooltip="text")

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

How to screen for the genes that satisfy your filtering criteria

Once the differentially-expressed genes between your conditions of interest have been identified, you may want to save them to a convenient location so they can be easily read in for subsequent analyses.

# Filter genes using dplyr
# Define thresholds
fold_cutoff = 0.5
pvalue_cutoff = 0.01
#filtering pipeline
filtered <- DEres.df %>% filter( abs(logFC) >= fold_cutoff & 
                                   pvalue <= pvalue_cutoff )

Show original versus filtered dimensions:

dim(DEres.df)

## [1] 12488    10

dim(filtered)

## [1] 318  10

Now we can save the dataframe of significantly differentially-expressed genes to an R object (.rds) for easy re-loading. Optionally also save the significantly differentially-expressed gene symbols to a text file.

# save to rds
#saveRDS(filtered, "GSE5583_DE.rds")
# save to text
#write.table(filtered, "GSE5583_DE.txt", sep = "\t", quote = FALSE)

How to make heatmaps of gene expression

We use our filtered DE results from our example GSE5883 dataset to generate heatmaps by two different R methods.

head(DEres.df)

##     geneIDs WT.GSM130365 WT.GSM130366 WT.GSM130367 KO.GSM130368 KO.GSM130369
## 1 100001_at     3.523562     2.485427     6.110614     3.972693     5.169925
## 2 100002_at     4.357552     5.017922     6.543805     4.990955     3.847997
## 3 100003_at     6.177918     6.475733     6.307429     6.330917     7.023477
## 4 100004_at     8.027906     7.821455     8.512148     8.754888     8.805099
## 5 100005_at    10.085074    10.603255    10.538286    10.052840    10.415636
## 6 100006_at     5.635174     5.725196     3.906891     4.689299     5.608809
##   KO.GSM130370       logFC    pvalue   logpval
## 1     5.392317 -0.80511083 0.5449730 0.2636250
## 2     4.517276  0.85435054 0.3253745 0.4876165
## 3     6.437960 -0.27709146 0.3287830 0.4830906
## 4     8.170426 -0.45630111 0.1892376 0.7229925
## 5    10.488041  0.09003287 0.6928410 0.1593664
## 6     5.776104 -0.26898401 0.7180077 0.1438709

Prep data

#select only numeric columns for generating heatmap
filtered <- filtered %>% dplyr::select(2:7)
# coerce df to matrix
filtered <- as.matrix(filtered)

Cluster the rows (genes) & columns (samples) by correlation

rowv = as.dendrogram(hclust(as.dist(1-cor(t(filtered)))))
colv = as.dendrogram(hclust(as.dist(1-cor(filtered))))

Generate a heatmap

# base heatmap
heatmap(filtered, Rowv=rowv, Colv=colv, cexCol=0.7)

# Enhanced heatmap
heatmap.2(filtered, Rowv=rowv, Colv=colv, cexCol=0.7,
          col = rev(redblue(256)), scale = "row")

Part II: Automated Differential Expression Testing with `Seurat`

This section demonstrates how to perform an equivalent gene differential expression testing analysis as shown in the previous base-R example workflow, except using Seurat built-in differential expression testing functionalities. In Seurat, differential expression testing between two populations in a dataset is implemented with the FindMarkers() function. As a default, Seurat performs differential expression based on the non-parametric Wilcoxon rank sum test.

Load NSCI data from your celltype-specific `Seurat` object

Replace the code Ast (for “astrocytes”) below with the code for your assigned cell type. Optionally, also replace the experimental time point at which you want to analyze differential expression.

Data specifications:

# specify your cell type 
my_celltype <- "Ast"
# specify time point of interest
my_timept <- "2mo"

Now, load in the data from your celltype-specific Seurat object, which should be saved in a file in your student-notebooks directory with a filename in the format "FinalMergedData-my_celltype.rds"(where my-celltype is your celltype code). Change the filepath in the code below to match the location of the file in your workspace, wherever you have it saved. For a reminder of how this file should have been generated, see your copy of Notebook4-Data-Access-and-Extraction.Rmd, description under the section “Day 3 Take Home Assignment”*.

# Load Seurat data
# Change the filepath in quotes "" below to match where you 
# have your celltype data saved
# The expected filepath is "~/AlzheimersDS/student-notebooks/FinalMergedData-my_celltype.rds"
# (where `my-celltype` is your celltype code)

my_celltype_data <- readRDS("/data/Alzheimers_DS/FinalMergedData-Ast.rds")

Results derived from D.E. will be most meaningful if we separate our celltype data by the experimental timepoint at which it was collected. The code below will make a subset of your celltype-specific Seurat object by time, creating an object containing only the expression data for cells of your cell type, at the specified timepoint.

# Subset my celltype-specific Seurat object by time
my_seurat <- subset(x = my_celltype_data, subset = Age == my_timept)

Get count of genes to be tested:

nrow(my_seurat)

## [1] 40096

Run differential expression (DE) analysis

Find differentially-expressed genes between the conditions "V337M" (mutant) versus "V337V" (wildtype) for this celltype. Perform the DE analysis using the default recommended test (Wilcoxon rank sum).

Recall that by default, this Seurat function will draw on the normalized expression values in the Seurat object slot data. See Notebook 2 for further explanation of the different data slots in Seurat objects.

# set active identities of my subset seurat to "Mt" attribute (denoting variant)
# for DE testing
Idents(my_seurat) <- "Mt"
summary(as.factor(my_seurat$Mt))

## V337M V337V 
##    81    98

# Run D.E. test to identify genes differentially expressed between
# between diseased and normal cells, for my celltype and at this timepoint
marker_genes.df <- FindMarkers(my_seurat, 
                               ident.1 = "V337M", ident.2 = "V337V") 
#save results
saveRDS(marker_genes.df, sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept))

Inspect results

# order results by p-value
marker_genes.df <- marker_genes.df %>% arrange(p_val) 
head(marker_genes.df)

##                 p_val avg_log2FC pct.1 pct.2 p_val_adj
## HMGXB4   2.964098e-05 -0.4700985 0.247 0.541         1
## SNRPGP10 6.754363e-05  0.3308939 0.222 0.031         1
## TTBK2    1.044893e-04 -0.3491722 0.049 0.265         1
## RAB28    1.456217e-04  0.3951541 0.383 0.143         1
## HIRIP3   3.108470e-04 -0.2753372 0.037 0.224         1
## DOK5     3.644166e-04  0.4949083 0.617 0.357         1

Plot results distributions

# Plot pvalues
marker_genes_hist <- ggplot(marker_genes.df, aes(x=p_val)) +
    geom_histogram( binwidth=0.1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle(sprintf("P-Value Distribution for D.E. Genes in %s at Timepoint %s",my_celltype,my_timept)) +
   labs(x= "P Value", y = "Count") +
    theme_minimal() 
marker_genes_hist

# Plot log fold changes in expression from mt to normal
marker_genes_hist2 <- ggplot(marker_genes.df, aes(x=avg_log2FC)) +
    geom_histogram( binwidth=0.1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle(sprintf("LogFC Distribution for D.E. Genes in %s at Timepoint %s",my_celltype,my_timept)) +
   labs(x= "logFC", y = "Count") +
    theme_minimal() 
marker_genes_hist2

Note on interpretation

In this context, positive log fold change values indicate that the gene is more highly expressed in mutant cells relative to normal, and vice versa for negative values.

Generate a Volcano Plot of Gene Expression

Set significance thresholds and generate volcano plot

# set thresholds 
fold_cutoff = 0.5
pvalue_cutoff = 0.01
# format a new dataframe using the contents of
# marker_genes.df, to pass to volcanoPlot()"
plot.df <- marker_genes.df %>% 
  tibble::rownames_to_column("geneIDs") %>% 
  dplyr::select(geneIDs, avg_log2FC, p_val) %>% 
  mutate(logpval = -log10(p_val))
colnames(plot.df) <- c("geneIDs","logFC","pvalue","logpval")
# generate the plot
plot <- volcanoPlot(plot.df, fold_cutoff, pvalue_cutoff, 
            title = sprintf("Gene Expression in %s Cells (%s)",my_celltype,my_timept))
ggplotly(plot, tooltip="text")

Filter your D.E. genes for this timepoint and save to external files for easy re-loading

# Filter output of FindMarkers() genes using dplyr
# Define thresholds
fold_cutoff = 0.5
pvalue_cutoff = 0.01
#filtering pipeline
filtered <- DEres.df %>% filter( abs(logFC) >= fold_cutoff & 
                                   pvalue <= pvalue_cutoff )

Save the dataframe of the output of running FindMarkers to get significantly differentially-expressed genes in your celltype to an R object (.rds) for easy re-loading. Optionally also save the significantly differentially-expressed gene symbols to a text file.

# save to rds
saveRDS(marker_genes.df, sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept))
# save to text
write.table(x = filtered, file = sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept), sep = "\t",
             quote = FALSE)

How the pattern of significantly expressed/repressed genes changes over time across cell types?

# specify another time points of interest
my_timept_2 <- "4mo"
my_timept_3 <- "6mo"

# Subset my celltype-specific Seurat object by times
my_seurat_2 <- subset(x = my_celltype_data, subset = Age == my_timept_2)
my_seurat_3 <- subset(x = my_celltype_data, subset = Age == my_timept_3)

Get count of genes to be tested:

nrow(my_seurat_2)

## [1] 40096

nrow(my_seurat_3)

## [1] 40096

Run differential expression (DE) analysis

# set active identities of my subset seurat to "Mt" attribute (denoting variant)
# for DE testing
Idents(my_seurat_2) <- "Mt"
Idents(my_seurat_3) <- "Mt"
summary(as.factor(my_seurat_2$Mt))

## V337M V337V 
##    83    93

summary(as.factor(my_seurat_3$Mt))

## V337M V337V 
##   120   206

# Run D.E. test to identify genes differentially expressed between
# between diseased and normal cells, for my celltype and at this timepoint
marker_genes_2.df <- FindMarkers(my_seurat_2, 
                               ident.1 = "V337M", ident.2 = "V337V") 
marker_genes_3.df <- FindMarkers(my_seurat_3, 
                               ident.1 = "V337M", ident.2 = "V337V") 
#save results
saveRDS(marker_genes_2.df, sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept_2))
saveRDS(marker_genes_3.df, sprintf("DE_genes_MtvsWt_%s_%s.rds", my_celltype,my_timept_3))

Inspect results

# order results by p-value
marker_genes_2.df <- marker_genes_2.df %>% arrange(p_val) 
head(marker_genes_2.df)

##                 p_val avg_log2FC pct.1 pct.2  p_val_adj
## RPL9     2.436679e-06 -0.4309564 1.000 1.000 0.09770109
## CHCHD2   4.873454e-05  0.9258074 0.639 0.452 1.00000000
## RPS5     6.022260e-05 -0.3538973 1.000 1.000 1.00000000
## RPL24    6.876765e-05 -0.2768466 1.000 1.000 1.00000000
## RPS13    8.903810e-05 -0.3510350 1.000 1.000 1.00000000
## HSP90AB1 1.384171e-04 -0.3997375 1.000 1.000 1.00000000

marker_genes_3.df <- marker_genes_3.df %>% arrange(p_val) 
head(marker_genes_3.df)

##                p_val avg_log2FC pct.1 pct.2  p_val_adj
## FAM183A 1.057416e-06  0.3126794 0.208 0.039 0.04239813
## RSPH1   6.416327e-06  0.5408231 0.292 0.107 0.25726906
## SDR39U1 8.923202e-06  0.2776002 0.367 0.146 0.35778471
## CHCHD2  7.523072e-05  0.5406251 0.742 0.534 1.00000000
## GSTO1   1.188225e-04  0.2758435 0.483 0.277 1.00000000
## KIF5B   2.266152e-04 -0.2973771 0.917 0.961 1.00000000

# format a new dataframe using the contents of
# marker_genes.df, to pass to volcanoPlot()
plot_2.df <- marker_genes_2.df %>% 
  tibble::rownames_to_column("geneIDs") %>% 
  dplyr::select(geneIDs, avg_log2FC, p_val) %>% 
  mutate(logpval = -log10(p_val))
colnames(plot_2.df) <- c("geneIDs","logFC","pvalue","logpval")
# generate the plot
plot_2 <- volcanoPlot(plot_2.df, fold_cutoff, pvalue_cutoff, 
            title = sprintf("Gene Expression in %s Cells (%s)",my_celltype,my_timept_2))

plot_3.df <- marker_genes_3.df %>% 
  tibble::rownames_to_column("geneIDs") %>% 
  dplyr::select(geneIDs, avg_log2FC, p_val) %>% 
  mutate(logpval = -log10(p_val))
colnames(plot_3.df) <- c("geneIDs","logFC","pvalue","logpval")
# generate the plot
plot_3 <- volcanoPlot(plot_3.df, fold_cutoff, pvalue_cutoff, 
            title = sprintf("Gene Expression in %s Cells (%s)",my_celltype,my_timept_3))

Generate Volcano Plots for each timepoint

# render plots individually as interactives
ggplotly(plot, tooltip="text")

ggplotly(plot_2, tooltip="text")

ggplotly(plot_3, tooltip="text")

Show combined plots for comparison

plot_all_time <- ggarrange(plot, plot_2, plot_3,
                    labels = c("A", "B", "C"),
                    ncol = 2, nrow = 2, widths = c(1, 1), common.legend = TRUE)
plot_all_time

plot_a <- ggplotly(plot, tooltip="text")
plot_b <- ggplotly(plot_2, tooltip="text")
plot_c <- ggplotly(plot_3, tooltip="text")

fig <- subplot(plot_a, plot_b, plot_c,  nrows = 2, margin = 0.05) %>% 
  layout(showlegend = F, title = sprintf('How expressed/repressed genes changes over time across %s', my_celltype), annotations = list( 
    list(x = 0.2 , y = 1, text = "2mo", showarrow = F, xref='paper', yref='paper'), 
    list(x = 0.75 , y = 1, text = "4mo", showarrow = F, xref='paper', yref='paper'),
    list(x = 0.2 , y = 0.48, text = "6mo", showarrow = F, xref='paper', yref='paper')))
fig

Filter your D.E. genes at each timepoint using your desired significance thresholds

# Helper function to 
# filter results of FindMarkers() using dplyr

filterGenes <- function(DEres.df, fold_cutoff = 0.5, 
                                  pvalue_cutoff = 0.01) {
  #filtering pipeline to reduce results dataframe
  # to only the genes defined to be significant
  # by the thresholds specified 
  
  filtered <- DEres.df %>% filter( abs(avg_log2FC) >= fold_cutoff & 
                                     p_val <= pvalue_cutoff )
  return(filtered)
}


# change these thresholds if desired
fold_cutoff = 0.5
pvalue_cutoff = 0.01

# filter DE results for each time point
DEres2mo_filtered <- filterGenes(marker_genes.df, fold_cutoff, pvalue_cutoff)
DEres4mo_filtered <- filterGenes(marker_genes_2.df, fold_cutoff, pvalue_cutoff)
DEres6mo_filtered <- filterGenes(marker_genes_3.df, fold_cutoff, pvalue_cutoff)

head(DEres2mo_filtered)

##                 p_val avg_log2FC pct.1 pct.2 p_val_adj
## HMGN3     0.001051848  0.5064855 0.938 0.888         1
## PCLO      0.002314014 -0.5419183 0.198 0.398         1
## PTGDS     0.004890413 -1.0792829 0.185 0.367         1
## MT-RNR2   0.005921988 -0.5203609 1.000 1.000         1
## KIDINS220 0.006594584 -0.5072945 0.481 0.622         1
## RUNX1T1   0.007386023 -0.5372801 0.049 0.173         1

Gene count at 2 months before filtering:

nrow(marker_genes.df)

## [1] 334

Gene count at 2 months after filtering:

nrow(DEres2mo_filtered)

## [1] 8

Visualize differential expression over time for the top most-variable genes

We generate a dot plot to help summarize which differentially-expressed genes are represented at each timepoint, using the same significance criteria as shown previously on the volcano plots. Here we plot only significant genes, and create a factor DE_status denoting whether the gene is significantly upregulated or downregulated. DE_status is then used to color the data points. The magnitude of differential expression for each gene is denoted by its avg_log2FC on the y-axis.

# create columns for gene id
DEres2mo_filtered<- DEres2mo_filtered %>% tibble::rownames_to_column("gene")
DEres4mo_filtered<- DEres4mo_filtered %>% tibble::rownames_to_column("gene")
DEres6mo_filtered<- DEres6mo_filtered %>% tibble::rownames_to_column("gene")

# create columns for timepoint
DEres2mo_filtered$timepoint <- rep(2,nrow(DEres2mo_filtered))
DEres4mo_filtered$timepoint <- rep(4,nrow(DEres4mo_filtered))
DEres6mo_filtered$timepoint <- rep(6,nrow(DEres6mo_filtered))

# vertically combine or "stack" the dataframes
topDE_genes.df <- dplyr::bind_rows(DEres2mo_filtered, 
                                   DEres4mo_filtered,
                                   DEres6mo_filtered)

# create column for DE status
topDE_genes.df <- topDE_genes.df %>% mutate( DE_status = 
                                  ifelse(avg_log2FC < -(fold_cutoff) & 
                                  p_val < pvalue_cutoff,
                                   "Significant, downregulated", 
                                ifelse(avg_log2FC > fold_cutoff & 
                                   p_val < pvalue_cutoff, 
                                   "Significant, upregulated", "Normally expressed")))
# factorize relevant columns
topDE_genes.df$timepoint <- as.factor(topDE_genes.df$timepoint)
topDE_genes.df$DE_status <- as.factor(topDE_genes.df$DE_status)

dim(topDE_genes.df)

## [1] 25  8

p <- ggplot(topDE_genes.df, aes(x = timepoint, y = avg_log2FC))+
  geom_point(aes(color = DE_status), size = 2) +
  geom_text_repel(label = topDE_genes.df$gene,
                  max.overlaps = getOption("ggrepel.max.overlaps", 
                                                     default = 20)) +
  labs(title = sprintf("Genes Differentially Expressed between Mt and Wt in %s Cells",
                                                                  my_celltype),
                 x = "Developmental timepoint (months)") +
  theme_minimal()

p

Summarize biological relevance of D.E. genes at each timepoint

We can use Gene Ontology Enrichment to highlight enriched biological functions of the genes identified to be significant, separately for each timepoint.

Perform Gene Ontology enrichment analysis on the D.E. genes at each timepoint

# Define function to perform the GO enrichment analysis
# Summarize most prominant biological functions 
# of input genes in humans
getGOTerms <- function(genes) {
  require(gprofiler2)
  ## Step 1: Run GO analysis on gene list
  # Results from the query are stored in GOres$result
  GOres <- gprofiler2::gost(query = c(genes),
  organism = "hsapiens", ordered_query = TRUE,
  multi_query = FALSE, significant = TRUE, user_threshold = 0.05, 
  source = c("GO:BP"))
  
  # Make a dataframe of major Gene Ontology terms 
  # and their p-values
  # Note we order the results by increasing p-value 
  # (most significant come first)
  if(!is.null(GOres)){
    GOres.df <- GOres$result %>%
    dplyr::select(term_name,term_id,p_value) %>%
    arrange(p_value)
  
  } else { GOres.df <- NULL }
  return(GOres.df)
}

Create input gene lists:

# Optionally re-define significance thresholds
fold_cutoff <- 0.3
pvalue_cutoff <- 0.01
  
# Create input gene lists
DEgenes2<- marker_genes.df %>% filter(abs(avg_log2FC) > fold_cutoff & 
                                              p_val < pvalue_cutoff) 
DEgenes4<- marker_genes_2.df %>% filter(abs(avg_log2FC) > fold_cutoff & 
                                              p_val < pvalue_cutoff) 
DEgenes6<- marker_genes_3.df %>% filter(abs(avg_log2FC)> fold_cutoff & 
                                              p_val < pvalue_cutoff) 
genes2 <- rownames(DEgenes2)
genes4 <- rownames(DEgenes4)
genes6 <- rownames(DEgenes6)
gene_lists <- list(genes2,genes4,genes6)
# remove any empty gene sets  
#(sets with no significant genes found)
gene_lists <- gene_lists[!sapply(gene_lists,is.null)]

Input genes to GO pipeline and show results (if found)

# `goresults` will be a list of 3 dataframes, 
# each one containing the GO output for one timepoint.
# Results data frames will be NULL if Gene Ontology 
# does not find significant terms for those genes.
goresults <- lapply(X = gene_lists,
                         FUN = function(gl) {
                         terms <- getGOTerms(gl)
                         return(terms)} )

## Gene Ontology found significantly-enriched biological 
##       processes for timepoints:  4 months

Enriched GO Terms, 4 months
term_name	term_id	p_value
cytoplasmic translation	GO:0002181	0.0018601
trachea cartilage morphogenesis	GO:0060535	0.0091736
developmental growth involved in morphogenesis	GO:0060560	0.0322284

Part III: How to make volcano plots using `ggplot2`

######################################################
# Volcano Plot using ggplot2
######################################################
## set thresholds 
fold_cutoff = 0.5
pvalue_cutoff = 0.01
# Prepare plot data
my_marker_genes.df <- marker_genes.df %>%
  dplyr::select(avg_log2FC, p_val) %>%
  mutate(diffexpressed = ifelse(avg_log2FC < -(fold_cutoff) & p_val <= pvalue_cutoff, "Significant, downregulated", 
                                ifelse(avg_log2FC > fold_cutoff & p_val <= pvalue_cutoff, "Significant, upregulated", "Normally expressed")))
# convert rownames to a column of the dataframe called "gene"
my_marker_genes.df <- cbind(gene = rownames(marker_genes.df), my_marker_genes.df)
rownames(my_marker_genes.df) <- 1:nrow(my_marker_genes.df)
# create an empty "label" column
my_marker_genes.df$label <- NA
# Transform "label" column into a list of the genes to plot labels for
my_marker_genes.df$label[my_marker_genes.df$diffexpressed != "Normally expressed"] <- my_marker_genes.df$gene[my_marker_genes.df$diffexpressed != "Normally expressed"]
# Plot Volcano plots using ggplot
ggplot(data=my_marker_genes.df, aes(x=avg_log2FC, y=-log10(p_val), col=diffexpressed, label=label)) +
        geom_point() + 
        theme_minimal() +
        geom_text_repel() +
        scale_color_manual(values=c("black", "blue", "red")) +
        geom_vline(xintercept=c(-(fold_cutoff), fold_cutoff), col="red") +
        geom_hline(yintercept=-log10(pvalue_cutoff), col="red") +
        ggtitle("Volcano Plot of Gene Expression")

D.E. Analysis: Identifying Differentially-Expressed Genes between Experimental Groups

IDEA Alzheimer’s Data Analysis Bootcamp - Summer 2022

Ahmed Moustafa, Haowen He, Rachael White

Overview

Part I: Manual Differential Expression Testing

Load GSE5583 data

Exploratory data analysis

Create a dataframe to store D.E. results

Run differential expression (DE) analysis

Compute fold-change difference between the means of the conditions

Compute statistical significance (using t-test)

Generate Volcano Plots

`volcanoPlot`: Function to generate the volcano plot highlighting significantly differentially-expressed genes using `ggplot2`

How to screen for the genes that satisfy your filtering criteria

How to make heatmaps of gene expression

Part II: Automated Differential Expression Testing with `Seurat`

Load NSCI data from your celltype-specific `Seurat` object

Data specifications:

Get count of genes to be tested:

Run differential expression (DE) analysis

Inspect results

Note on interpretation

Generate a Volcano Plot of Gene Expression

Filter your D.E. genes for this timepoint and save to external files for easy re-loading

How the pattern of significantly expressed/repressed genes changes over time across cell types?

Get count of genes to be tested:

Run differential expression (DE) analysis

Inspect results

Generate Volcano Plots for each timepoint

Show combined plots for comparison

Filter your D.E. genes at each timepoint using your desired significance thresholds

Visualize differential expression over time for the top most-variable genes

Summarize biological relevance of D.E. genes at each timepoint

Perform Gene Ontology enrichment analysis on the D.E. genes at each timepoint

Create input gene lists:

Input genes to GO pipeline and show results (if found)

Part III: How to make volcano plots using `ggplot2`

D.E. Analysis: Identifying Differentially-Expressed Genes between Experimental Groups

IDEA Alzheimer’s Data Analysis Bootcamp - Summer 2022

Ahmed Moustafa, Haowen He, Rachael White

Overview

Part I: Manual Differential Expression Testing

Load GSE5583 data

Exploratory data analysis

Create a dataframe to store D.E. results

Run differential expression (DE) analysis

Compute fold-change difference between the means of the conditions

Compute statistical significance (using t-test)

Generate Volcano Plots

volcanoPlot: Function to generate the volcano plot highlighting significantly differentially-expressed genes using ggplot2

How to screen for the genes that satisfy your filtering criteria

How to make heatmaps of gene expression

Part II: Automated Differential Expression Testing with Seurat

Load NSCI data from your celltype-specific Seurat object

Data specifications:

Get count of genes to be tested:

Run differential expression (DE) analysis

Inspect results

Note on interpretation

Generate a Volcano Plot of Gene Expression

Filter your D.E. genes for this timepoint and save to external files for easy re-loading

How the pattern of significantly expressed/repressed genes changes over time across cell types?

Get count of genes to be tested:

Run differential expression (DE) analysis

Inspect results

Generate Volcano Plots for each timepoint

Show combined plots for comparison

Filter your D.E. genes at each timepoint using your desired significance thresholds

Visualize differential expression over time for the top most-variable genes

Summarize biological relevance of D.E. genes at each timepoint

Perform Gene Ontology enrichment analysis on the D.E. genes at each timepoint

Create input gene lists:

Input genes to GO pipeline and show results (if found)

Part III: How to make volcano plots using ggplot2

`volcanoPlot`: Function to generate the volcano plot highlighting significantly differentially-expressed genes using `ggplot2`

Part II: Automated Differential Expression Testing with `Seurat`

Load NSCI data from your celltype-specific `Seurat` object

Part III: How to make volcano plots using `ggplot2`