Analyzing RNA-seq reads allows you to understand the expression of any gene of interest. Here, I outline the steps you can take to turn your mapped reads into a count dataframe or matrix for downstream analyses and data visualization. I will model this workflow on short-read and long-read data. The short-data used in this workflow is publicly available and located at: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA494179 .

Prior to this workflow, you should have mapped the your raw short-read sequences using an read-mapping or alignment tool like HISAT2 (http://daehwankimlab.github.io/hisat2/manual/) or STAR (https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf). From there you should have taken the alignment files (bam files) and quantified your reads using Salmon (https://salmon.readthedocs.io/en/latest/salmon.html).

For long-read data the workflow is slightly less complicated. You should be able to map your raw long-read data with minimap2 (https://lh3.github.io/minimap2/minimap2.html) and generate alignment files (bam files). These bam files can be used as input into FeatureCounts (https://subread.sourceforge.net/featureCounts.html). From, you should have a counts matrix that you can work with in R.

If you would like a more in-depth workflow -> https://www.bioconductor.org/help/course-materials/2019/CSAMA/materials/labs/lab-03-rnaseq/rnaseqGene_CSAMA2019.html#salmon-quantification

Prepare your working environment

Here, we are going to install and load the packages that we will use in this workflow.

Install packages

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.15")

BiocManager::install("tximport")
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'tximport'
BiocManager::install("readr")
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'readr'
BiocManager::install("magrittr")
BiocManager::install("DESeq2")
BiocManager::install("RColorBrewer")
BiocManager::install("tidyverse")
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'tidyverse'
BiocManager::install("GenomicFeatures", force = TRUE)
BiocManager::install("pheatmap", force = TRUE)
BiocManager::install("ashr")
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'ashr'
#Once you have installed the packages once, you won't need to do this again unless you wipe or clear your working environment
library(tximport)
library(DESeq2)
library(tidyverse)
library(ggpubr)
library(GenomicFeatures)
library(magrittr)
library(readr)
library(ashr)
library(RColorBrewer)
library(pheatmap)

###Generate a raw counts and abundance dataframe from the short-read data###

Load in your genome annotation file

AT_TxDb <- makeTxDbFromGFF("/mnt/Kanta/rna_seq_tutorial/Arabidopsis_thaliana.TAIR10.53.mod.gff3", format = "gff")

Get annotaion ready for tximport, here we are mapping transcript names to the gene. This will allow us to generate a gene-level counts dataframe

k <- keys(AT_TxDb, keytype = "TXNAME")
tx2gene <- AnnotationDbi::select(AT_TxDb, k, "GENEID", "TXNAME")

Store the direct path to your data (the place where you generated the quant directories) in an object

dir_maf4_ABA <- "/mnt/Kanta/rna_seq_tutorial/quant_output"

Read in metadata

  • metadata can be grabbed from the sample info located on NCBI. The metadata should map each quant directory to its respecitve treatment or condition
samplelist_ABA <- read_delim("/mnt/Kanta/rna_seq_tutorial/ABA_samplelist.txt", 
    delim = "\t", escape_double = FALSE, 
    trim_ws = TRUE)
## Rows: 20 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Run, Condition
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#head(samplelist_ABA) #double check your sample list

Create an object that stores the path to each of your quant files

Qfiles_ABA <- file.path(dir_maf4_ABA, samplelist_ABA$Run, "quant.sf")

#head(Qfiles_ABA) #make sure the paths to these files exists and is correct

Let’s get gene level TPM

-> If you are interested in doing a differential expression analysis, DESeq2 will always want counts and it will normalize from there. Best practice is to generate the tximport object containing your expression data and using that as input for DESeq2 -> If you are interested in z-scoring and visualizing relative abundance/expression of a transcript you will want to save the abundance matrix of the tximport object - Lets generate both files!

txi_ABA <- tximport(files = Qfiles_ABA, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
## transcripts missing from tx2gene: 1770
## summarizing abundance
## summarizing counts
## summarizing length
#Add more informative experiment identifiers to our counts matrix
colnames(txi_ABA$counts) <- samplelist_ABA$Condition

#Lets save this counts matrix 
write.csv(txi_ABA$counts, 
          file="/mnt/Kanta/rna_seq_tutorial/txi_ABA_counts.csv")

##We can do the same for our abundance matrix
colnames(txi_ABA$abundance) <- samplelist_ABA$Condition
write.csv(txi_ABA$abundance, 
          file="/mnt/Kanta/rna_seq_tutorial/txi_ABA_tpm.csv")

###Conduct differential expression analysis###

dds_txi_ABA <- DESeqDataSetFromTximport(txi_ABA,
                                   colData = samplelist_ABA,
                                   design = ~Condition)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## using just counts from tximport
# Interestd in pre-filtering genes before you run the differential expression? You can keep all genes that have at least 10 counts or more 
#keep <- rowSums(counts(dds_txi_ABA)) >= 10 # store the gene names that have at least 10 counts acorss the entire experiment
#dds_txi_ABA_filt <- ddsTxi_maf4_ABA[keep,] # subset the dataframe keeping only the genes that meet the count threshhold

Do the differential expression

#maf4_ABA
dds_ABA <- DESeq(dds_txi_ABA)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

For QC purposes

#Let's see how our samples cluster

vst_ABA <- vst(dds_ABA, blind = FALSE)
vst_ass <- assay(vst_ABA)
plotPCA(vst_ABA , intgroup = c("Condition"))

#Looks like the majority of variance is due to time post-ABA treatment. The PCA only takes the top 10% variable genes. Let's see if we can generate a heatmap that visualizes the clustering of samples using all the genes. 
sampleDists <- dist(t(assay(vst_ABA)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vst_ABA$Condition) #assign informative rownames
colnames(sampleDistMatrix) <- paste(vst_ABA$Run) #assign informative column names
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(100)
pheatmap(sampleDistMatrix,
          clustering_distance_rows = sampleDists,
          clustering_distance_cols = sampleDists,
          col=colors)

#There are a few samples of different timepoints that cluster together - this something to check on. But, let's continue for now

#Introduce your comparisons of interest

#this is where you want to introduce the contrasts that you would like to make
res_ABA_2hr <- results(dds_ABA, contrast = c("Condition", "2hr_ABA", "0hr_ABA"))
#res_ABA_2hr 

res_ABA_4hr <- results(dds_ABA, contrast = c("Condition", "4hr_ABA", "0hr_ABA"))
#res_ABA_4hr 

res_ABA_6hr <- results(dds_ABA, contrast = c("Condition", "6hr_ABA", "0hr_ABA"))
#res_ABA_6hr 

res_ABA_8hr <- results(dds_ABA, contrast = c("Condition", "8hr_ABA", "0hr_ABA"))
#res_ABA_8hr 

#Here we can visualize the most significanlty differentirally expressed gene at any specific time point

topGene <- rownames(res_ABA_2hr)[which.min(res_ABA_8hr$padj)] #grab the gene that has the smallest adjusted p-value for that time-point
plotCounts(dds_ABA, gene = topGene, intgroup = c("Condition")) #plot it 

# Let’s generate an MA plot to visualize the total differential expression

resultsNames(dds_ABA) #This will give you all the comparisons that you can look at 
## [1] "Intercept"                    "Condition_2hr_ABA_vs_0hr_ABA"
## [3] "Condition_4hr_ABA_vs_0hr_ABA" "Condition_6hr_ABA_vs_0hr_ABA"
## [5] "Condition_8hr_ABA_vs_0hr_ABA"
res <- lfcShrink(dds_ABA, coef="Condition_8hr_ABA_vs_0hr_ABA", type="ashr") #normalize lfc at a specific timepoint
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041
DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-10,10)) #plot your MA plot to visualize the differentially expressed genes

###Generate a raw counts and abundance dataframe from long-read data###

Read in your featureCounts table

at_ftc <- read.table("/mnt/Kanta/rna_seq_tutorial/at_drna_ftc.out", header=TRUE, row.names=1) 
at_ftc <- at_ftc[ ,6:ncol(at_ftc)] #Only keep the columns containing raw counts for each gene (columns 6,7,8,9)
colnames(at_ftc) <- gsub("\\.[sb]am$", "", colnames(at_ftc)) #Clean up the column names
colnames(at_ftc) <- gsub("align_to_genome.", "", colnames(at_ftc))

Generate dataframe for input into DESeq2

at_ftc <- as.matrix(at_ftc) #make this counts table into a matrix
condition <- factor(c(rep("ctl", 2), rep("stress", 2))) #save the conditions into an object 

coldata <- data.frame(row.names=colnames(at_ftc), condition) # Create a coldata frame that maps each sample/column to conditions for DESeq2

dds_heat_mat <- DESeqDataSetFromMatrix(countData=at_ftc, colData=coldata, design=~condition) #generate the dataframe


# Interestd in pre-filtering genes before you run the differential expression? You can keep all genes that have at least 3 counts or more 
#keep <- rowSums(counts(dds_heat_mat)) >= 3 # store the gene names that have at least 3 counts acorss the entire experiment
#dds_heat_mat_filt <- dds_heat_mat[keep,] # subset the dataframe keeping only the genes that meet the count threshhold

Run the DESeq pipeline and save it for future use

dds_at <- DESeq(dds_heat_mat, parallel = T)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates: 30 workers
## mean-dispersion relationship
## final dispersion estimates, fitting model and testing: 30 workers
vsd_at <- vst(dds_heat_mat, blind = F) 

res_dds_at <- results(dds_at, contrast = c("condition", "stress", "ctl")) 

write.csv(res_dds_at, file = "/mnt/Kanta/rna_seq_tutorial/res_dds_at.csv")

-> Now, we can do some quality checking just like we did for short-read data

Generate a PCA plot

plotPCA(DESeqTransform(vsd_at))

#Majority of our variance is explained by condition!

#We can also generate a pheatmap that visualizes the sample alike-ness

sampleDists <- dist(t(assay(vsd_at)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd_at$condition)
colors <- colorRampPalette( rev(brewer.pal(9, "Reds")) )(100)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col=colors,
         annotation_names_col = TRUE)

#our samples cluster by conditions!

#Visualize the most significanlty differentirally expressed gene at any specific time point

topGene <- rownames(res_dds_at)[which.min(res_dds_at$padj)] #grab the gene that has the smallest adjusted p-value for that time-point
plotCounts(dds_at, gene = topGene, intgroup = c("condition")) #plot it 

#MA plot to visualize the total differential expression

resultsNames(dds_at) #This will give you all the comparisons that you can look at 
## [1] "Intercept"               "condition_stress_vs_ctl"
res <- lfcShrink(dds_at, coef="condition_stress_vs_ctl", type="ashr") #normalize lfc at a specific timepoint
## using 'ashr' for LFC shrinkage. If used in published research, please cite:
##     Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
##     https://doi.org/10.1093/biostatistics/kxw041
DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-10,10)) #plot your MA plot to visualize the differentially expressed genes

You just finished a differential expression analysis!