Differential Gene Expression Analysis with Kallisto & DESeq2

Author

Chris Mantegna

Published

May 13, 2023

Packages & Libraries

Load the packages you’ll need by ‘calling’ them from your package library

Purpose

Using RNASeq data to compare gene expression is a standard workflow to help scientists clarify which genes are differently expressed than their ‘normal’ state when exposed to perturbations. We’re starting from the place where you’ve already QC’d your RNASeq files and are ready to jump into quantification and statistical analysis. If you’re interested in learning more about the the QC process and application, check out The Galaxy Project. If you already know but want to check out your options, look into FastQC or MultiQC.

Data

Existing RNA-Seq data was retrieved from the following complete RNASeq data available in the NCBI database:
- SRR19782039 Exposure to Valsartan & Carbamazepine
- SRR16771870 Exposure to a synthetic hormone 17 a-Ethinylestradiol (EE2)
- SRR7725722 Diarrhetic Shellfish Poisoning (DSP) toxins associated with Harmful Algal Blooms (HABs)
- SRR13013756 Hypoxia
- Mytilus galloprovincialis Reference Genome

Know where you are before you start

getwd()

Pulling my reference genome from prefab code provided by NCBI

curled genome and copied just cds file


 cd /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/data/ncbi_dataset/data/GCA_900618805.1
 head cds_from_genomic.fna

>lcl|UYJE01000001.1_cds_VDH88688.1_1 [locus_tag=MGAL_10B017214] [protein=Hypothetical predicted protein] [protein_id=VDH88688.1] [location=complement(join(55975..56224,59152..59365,60239..60337,61332..61522,64535..64608))] [gbkey=CDS]
ATGAATAGAATTACTGATAGGGACTACGACTACTATGACTTTGAAGATGACAGTGACCACGAGCCTTGCGATAGTTCTGA
TGATGATATCGAGGTTATTTTACATGGAACACCTGAACAGAAGCGTAAATTACAGACCAAAGTCCAACAAAGACATGATT
CTTCAAGTGAAGATGACTTTGAAAAGGAAATGAATAATGAACTTAACAAACATATTAAAGGACTGGTAAATGAAAGATCA
AGTAATGTTGCAGAAACTGTTCAAGGTAGTAGCAAAGCTCAAGACCAAGAGAAACCAACAGAACAACAACAATTTTATGA
TGATATTTATTACGATTCAGAAGAAGAGGAAATGGTTTTACAAGGTGATGAACGTGTCAAAAGAAGACAACCTGTTCAAA
GCAATGATGACTTATTGTACGATCCTGACCTAGACGAAGAAGACCAGCGATGGGTTGATGCTGAACGACAAGCTTATCAG
CTGCCTGTACCCTCAGGATCCAAATCAAAACGTCAAAACAGTGATGCAGTTTTAAACTGTCCCGCTTGTATGACATTACT
GTGTCTTGATTGTCAGGGGCATGATGTTTATGAAAACCAGTACAGAGCTATGTTTGTTAAGAACTGTCGTGTCGATACAT
CAGAATTATTAAAACAGCCGTTACAGAAGAAAAAACGTAAAAAAAAACAGAAGACATTGGACACTACAAATAATGAAACA

Create the index file to align my short read files to the genes from the MGAL_10

This file is to create an index. The index will be used to quantify the transcripts abundances. The index file maps the RNASeq reads to the transcriptome (or genome) to quantify expression level when we use Kallisto.

/home/shared/kallisto/kallisto \
index \
-i ../data/MGAL_cds.index \
../data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna

head ../data/ncbi_dataset/data/GCA_900618805.1/cds_from_genomic.fna

Runnning Kallisto

/home/shared/kallisto/kallisto quant

kallisto 0.46.1
Computes equivalence classes for reads and quantifies abundances

Usage: kallisto quant [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              quantification
-o, --output-dir=STRING       Directory to write output to

Optional arguments:
    --bias                    Perform sequence based bias correction
-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
    --seed=INT                Seed for the bootstrap sampling (default: 42)
    --plaintext               Output plaintext instead of HDF5
    --fusion                  Search for fusions for Pizzly
    --single                  Quantify single-end reads
    --single-overhang         Include reads where unobserved rest of fragment is
                              predicted to lie outside a transcript
    --fr-stranded             Strand specific reads, first read forward
    --rf-stranded             Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)
-t, --threads=INT             Number of threads to use (default: 1)
    --pseudobam               Save pseudoalignments to transcriptome to BAM file
    --genomebam               Project pseudoalignments to genome sorted BAM file
-g, --gtf                     GTF file for transcriptome information
                              (required for --genomebam)
-c, --chromosomes             Tab separated file with chromosome names and lengths
                              (optional for --genomebam, but recommended)

ls /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/

SRR13013756.fastq
SRR13013756_fastqc.html
SRR13013756_fastqc.zip
SRR16771870_1.fastq
SRR16771870_1_fastqc.html
SRR16771870_1_fastqc.zip
SRR16771870_2.fastq
SRR16771870_2_fastqc.html
SRR16771870_2_fastqc.zip
SRR19782039.fastq
SRR19782039_fastqc.html
SRR19782039_fastqc.zip
SRR7725722_1.fastq
SRR7725722_1_fastqc.html
SRR7725722_1_fastqc.zip
SRR7725722_2.fastq
SRR7725722_2_fastqc.html
SRR7725722_2_fastqc.zip

You will have to use the ‘mkdir’ bash command to create the output folder in Kalliso. The code is commented out so that it diens’t run, but it is a necessary step if this is your first time doing it.


#mkdir ../output
mkdir ../output/kallisto_01

#pulling all files for comparison to my index

find /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*_1.fastq \
| xargs basename -s _1.fastq  | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ../data/MGAL_cds.index \
-o ../output/kallisto_01/{} \
-t 4 \
/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}_1.fastq \
/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}_2.fastq \

Getting length mean & SD for the single read sequences in the .fastq files. Code taken from: https://www.biostars.org/p/243552/


awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}' /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*.fastq

Grabbing only the fastq files ## Here we are quantifying our RNASeq results with Kallisto. We do this by first searching for our fastq files, organizes what we find, cleans up the names of what we find, and then using the parameters set at the bottom of the code chunk we are creating individual outputs for each found file.


find /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/*.fastq \
| xargs basename -s .fastq  | xargs -I{} /home/shared/kallisto/kallisto \
quant -i ../data/MGAL_cds.index \
--single -l 92 -s 405 \
-o ../output/kallisto_01/{} \
-t 4 \
/home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/ncbi/{}.fastq

Double check where you’re working and where your data has landed.

getwd()

cd /home/shared/trinityrnaseq-v2.12.0

#Kallisto

This Bash code runs the ‘abundance_estimates’ script from the Trinity RNASeq package and creates a gene expression matrix from the output of Kallisto quantification results.


perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
    --gene_trans_map none \
    --out_prefix /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01 \
    --name_sample_by_basedir \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722_1/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR7725722_2/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR13013756/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870_1/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR16771870_2/abundance.tsv \
    /home/shared/8TB_HDD_02/cnmntgna/GitHub/chris-musselcon/output/kallisto_01/SRR19782039/abundance.tsv

We are creating a count matrix dataframe, a.k.a. a table of how many times a feature (i.e., gene) is observed in our sequence. Each row is a feature and each column is a sample. After we create the table we are going to take a look at it to make sure it looks like we want it. We’re doing this in standard R programming commands.

countmatrix <- read.delim("../output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)
#tail(countmatrix)
#dim(countmatrix)

Since all of the values in our table are very large and burdensome to read, we are going to round the values to make it easier to get at what we really want. Once we do that, we will use str() to take a fast look at the results.

countmatrix <- round(countmatrix, 0)
str(countmatrix)

dim(countmatrix)

#dim(deseq2.colData)

length(colnames(data))

We have what we want and we are moving into using the DESeq package. First, we’re going to add the metadata to the count matrix we made a few steps back. Second, we’re going to put them together and create a new dataset that can be processed in the DESeq program.

deseq2.colData <- data.frame(condition=factor(c("DSP", "DSP", "Hypoxia", "EE2", "EE2", "EE2", "Valsartan")), 
                             type=factor(rep("single-read", 7)))
rownames(deseq2.colData) <- colnames(data)
dim(deseq2.colData)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)

Take a look

deseq2.dds <- DESeq(deseq2.dds)
deseq2.res <- results(deseq2.dds)
deseq2.res <- deseq2.res[order(rownames(deseq2.res)), ]

head(deseq2.dds)

vsd <- vst(deseq2.dds, blind = FALSE)
plotPCA(vsd, intgroup = "condition")

# Select top 50 differentially expressed genes - is it valuable to select more than 50?
res <- results(deseq2.dds)
res_ordered <- res[order(res$padj), ]
top_genes <- row.names(res_ordered)[1:50]

# Extract counts and normalize
counts <- counts(deseq2.dds, normalized = TRUE)
counts_top <- counts[top_genes, ]

# Log-transform counts
log_counts_top <- log2(counts_top + 1)

# Generate heatmap
pheatmap(log_counts_top, scale = "row")

head(deseq2.res)

# Count number of hits with adjusted p-value less then 0.05
dim(deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ])

tmp <- deseq2.res
# The main plot
plot(tmp$baseMean, tmp$log2FoldChange, pch=20, cex=0.45, ylim=c(-3, 3), log="x", col="darkgray",
     main="DEG Valsartan vs DSP (pval <= 0.05)",
     xlab="mean of normalized counts",
     ylab="Log2 Fold Change")
# Getting the significant points and plotting them again so they're a different color
tmp.sig <- deseq2.res[!is.na(deseq2.res$padj) & deseq2.res$padj <= 0.05, ]
points(tmp.sig$baseMean, tmp.sig$log2FoldChange, pch=20, cex=0.45, col="red")
# 2 FC lines
abline(h=c(-1,1), col="blue")

# Prepare the data for plotting
res_df <- as.data.frame(deseq2.res)
res_df$gene <- row.names(res_df)

# Create volcano plot
volcano_plot <- ggplot(res_df, aes(x = log2FoldChange, y = -log10(padj), color = padj < 0.05)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("grey", "red")) +
  labs(title = "Volcano Plot",
       x = "Log2 Fold Change",
       y = "-Log10 Adjusted P-value",
       color = "Significantly\nDifferentially Expressed") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "top")

print(volcano_plot)

write.table(tmp.sig, "../output/DEGlist.tab", sep = '\t', row.names = T)

deglist <- read.csv("../output/DEGlist.tab", sep = '\t', header = TRUE)
deglist$RowName <- rownames(deglist)
deglist2 <- deglist[, c("RowName", "pvalue")] # Optionally, reorder the columns

datatable(deglist)

#Annotation ## BLAST to uniprot, use HISAT? ### Enrichr install/ call

install.packages("devtools")
library(devtools)
install_github("wjawaid/enrichR")
library("enrichR")
listEnrichrSites()
dbs <- listEnrichrDbs()
head(dbs)