Volcano Plot for differentially expressing genes

Author

Akash Mitra

Creating a volcano plot using only ggplot2 and tidyverse packages

About the data

The dataset is a processed transcriptomics data file in .csv format, comapring a control and a treated sample. It shows the fold change and the p-value of significance for each gene in the list (treated compared to the control). The dataset is available at this link DEG_data

Installing the required packages

Packages can be installed using the the command install.packages("package_name")

Following packages need to be installed:

  • install.packages("tidyverse")

  • install.packages("RColorBrewer")

  • install.packages("ggrepel")

Calling the package to the working environment

library(tidyverse)
library(RColorBrewer)
library(ggrepel)

Importing the data

Setting working directory

Before you can import the data, the working directory needs to be set. You can set the working directory by using the following command:

setwd("complete_path_of_folder")

This should be the folder/directory in which the CSV file is stored

Reading the .csv file

deg <- read_csv("DEG_data.csv")

Data cleaning

|> is the native pipe operator which when used transfers whatever is on the left of the pipe as the first argument in the function that is provided immediately after the pipe

deg_clean <- deg |> 
  select(2:5) |> 
  drop_na() |> 
  
  #adding a new column calculating log10(pvalue)
  mutate(negLog10pval = -log10(pval)) |> 
  
  #adding a new column where the default value will be NS, but "UPREGULATED" and "DOWNREGULATED" will be assigned if it meets the desired conditions
  mutate(diff_exp = case_when(log2fc > 0.6 & pval < 0.05 ~ "UPREGULATED",
                              log2fc < -0.6 & pval < 0.05 ~ "DOWNREGULATED",
                              .default = "NS")) |> 
  
  #changing the factor levels according to what is needed in the legend of the plot
  mutate(diff_exp = fct_relevel(diff_exp, "UPREGULATED", "DOWNREGULATED", "NS"))

#display the first 6 rows
head(deg_clean)
# A tibble: 6 × 6
  gene_symbol  pval  padj    log2fc negLog10pval diff_exp
  <chr>       <dbl> <dbl>     <dbl>        <dbl> <fct>   
1 MIR1302-2HG 1.00   1.00  1.59e+ 1   0.0000614  NS      
2 FAM138A     1.00   1.00 -2.84e-12   0.00000516 NS      
3 OR4F5       1.00   1.00 -2.84e-12   0.00000516 NS      
4 AL627309.1  0.701  1.00 -6.36e- 1   0.154      NS      
5 AL627309.3  1.00   1.00 -2.84e-12   0.00000516 NS      
6 AL627309.2  1.00   1.00 -2.84e-12   0.00000516 NS      

Declaring a list of genes to be annotated on the plot

This list can be prepared according to what genes need to be displayed on the plot.

prot <- c("CXCR5", "LINC00115", "FAM41C", "FBXO41", "TRIM54")  

Plotting the data using ggplot2 package

ggplot works in layers. So, after the ggplot() function, all the other functions following will be added using a ‘+’ symbol instead of the pipe operator |>

deg_clean |> 
  drop_na() |> 
  ggplot(aes(log2fc, -log10(pval), color = diff_exp))+
  
  #This layer shows all the points in the plot in the same colour
  geom_point(size = 2, alpha = 0.7, shape = 16)+
  
  #Thsi layer colours only the points specified in the 'prot' variable in the previous step
  geom_point(data = deg_clean |> 
               filter(gene_symbol %in% prot)
               , aes(log2fc, -log10(pval)), 
             color = "black", size = 3, shape = 21, fill = "#998ec3")+
  
  #drawing two vertical lines on the plot to denote upregulated and downregulated thresholds
  geom_vline(xintercept = c(0.6, -0.6), linetype = "dashed")+
  
  #drawing a horizontal line to denote the P-value cutoff
  geom_hline(yintercept = c(-log10(0.05)), linetype = "dashed")+
  scale_color_manual(values = c("#fc8d59", "#91bfdb", "grey70"),
                     labels = c("UPREGULATED", "DOWNREGULATED", "NS"),
                      name = "",
                     )+
  
  #marking the genes from the 'prot' variable
  geom_text_repel( data = deg_clean |> 
                     filter(gene_symbol %in% prot),
                  aes(label = gene_symbol, color = label),
                  box.padding = 0.5, 
                  point.padding = 0.3, 
                  max.overlaps = Inf,
                  color =  "black",
                  label.colour = "grey30",
                  fill = "white")+
  
  #adjusting the overall look
  scale_y_continuous(expand = c(0,0))+
  scale_x_continuous(limits = c(-10,10), breaks = seq(-10, 10, 2))+
  theme_classic()+
  labs(
    x = "Log2 Fold Change",
    y = "-log10 (P-value)"
  )+
  theme(axis.title = element_text(size = 20),
        axis.text.x = element_text(size = 15),
        axis.text.y = element_text(size = 15),
        legend.text = element_text(size = 10),
        legend.position = "top")+
  
  #overrides the default size of the legends
  guides(color = guide_legend(override.aes = list(size = 5)))