Analyzing RNA-seq reads allows you to understand the expression of any gene of interest. Here, I outline the steps you can take to turn your mapped reads into a count dataframe or matrix for downstream analyses and data visualization. I will model this workflow on short-read and long-read data. The short-data used in this workflow is publicly available and located at: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA494179 .
Prior to this workflow, you should have mapped the your raw short-read sequences using an read-mapping or alignment tool like HISAT2 (http://daehwankimlab.github.io/hisat2/manual/) or STAR (https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf). From there you should have taken the alignment files (bam files) and quantified your reads using Salmon (https://salmon.readthedocs.io/en/latest/salmon.html).
For long-read data the workflow is slightly less complicated. You should be able to map your raw long-read data with minimap2 (https://lh3.github.io/minimap2/minimap2.html) and generate alignment files (bam files). These bam files can be used as input into FeatureCounts (https://subread.sourceforge.net/featureCounts.html). From, you should have a counts matrix that you can work with in R.
If you would like a more in-depth workflow -> https://www.bioconductor.org/help/course-materials/2019/CSAMA/materials/labs/lab-03-rnaseq/rnaseqGene_CSAMA2019.html#salmon-quantification
Prepare your working environment
Here, we are going to install and load the packages that we will use in this workflow.
Install packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.15")
BiocManager::install("tximport")## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'tximport'
BiocManager::install("readr")## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'readr'
BiocManager::install("magrittr")
BiocManager::install("DESeq2")
BiocManager::install("RColorBrewer")
BiocManager::install("tidyverse")## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'tidyverse'
BiocManager::install("GenomicFeatures", force = TRUE)
BiocManager::install("pheatmap", force = TRUE)
BiocManager::install("ashr")## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'ashr'
#Once you have installed the packages once, you won't need to do this again unless you wipe or clear your working environment
library(tximport)
library(DESeq2)
library(tidyverse)
library(ggpubr)
library(GenomicFeatures)
library(magrittr)
library(readr)
library(ashr)
library(RColorBrewer)
library(pheatmap)###Generate a raw counts and abundance dataframe from the short-read data###
Load in your genome annotation file
AT_TxDb <- makeTxDbFromGFF("/mnt/Kanta/rna_seq_tutorial/Arabidopsis_thaliana.TAIR10.53.mod.gff3", format = "gff")Get annotaion ready for tximport, here we are mapping transcript names to the gene. This will allow us to generate a gene-level counts dataframe
k <- keys(AT_TxDb, keytype = "TXNAME")
tx2gene <- AnnotationDbi::select(AT_TxDb, k, "GENEID", "TXNAME")Store the direct path to your data (the place where you generated the quant directories) in an object
dir_maf4_ABA <- "/mnt/Kanta/rna_seq_tutorial/quant_output"Read in metadata
- metadata can be grabbed from the sample info located on NCBI. The metadata should map each quant directory to its respecitve treatment or condition
samplelist_ABA <- read_delim("/mnt/Kanta/rna_seq_tutorial/ABA_samplelist.txt",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)## Rows: 20 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Run, Condition
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#head(samplelist_ABA) #double check your sample listCreate an object that stores the path to each of your quant files
Qfiles_ABA <- file.path(dir_maf4_ABA, samplelist_ABA$Run, "quant.sf")
#head(Qfiles_ABA) #make sure the paths to these files exists and is correctLet’s get gene level TPM
-> If you are interested in doing a differential expression analysis, DESeq2 will always want counts and it will normalize from there. Best practice is to generate the tximport object containing your expression data and using that as input for DESeq2 -> If you are interested in z-scoring and visualizing relative abundance/expression of a transcript you will want to save the abundance matrix of the tximport object - Lets generate both files!
txi_ABA <- tximport(files = Qfiles_ABA, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## transcripts missing from tx2gene: 1770
## summarizing abundance
## summarizing counts
## summarizing length
#Add more informative experiment identifiers to our counts matrix
colnames(txi_ABA$counts) <- samplelist_ABA$Condition
#Lets save this counts matrix
write.csv(txi_ABA$counts,
file="/mnt/Kanta/rna_seq_tutorial/txi_ABA_counts.csv")
##We can do the same for our abundance matrix
colnames(txi_ABA$abundance) <- samplelist_ABA$Condition
write.csv(txi_ABA$abundance,
file="/mnt/Kanta/rna_seq_tutorial/txi_ABA_tpm.csv")###Conduct differential expression analysis###
dds_txi_ABA <- DESeqDataSetFromTximport(txi_ABA,
colData = samplelist_ABA,
design = ~Condition)## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
## using just counts from tximport
# Interestd in pre-filtering genes before you run the differential expression? You can keep all genes that have at least 10 counts or more
#keep <- rowSums(counts(dds_txi_ABA)) >= 10 # store the gene names that have at least 10 counts acorss the entire experiment
#dds_txi_ABA_filt <- ddsTxi_maf4_ABA[keep,] # subset the dataframe keeping only the genes that meet the count threshholdDo the differential expression
#maf4_ABA
dds_ABA <- DESeq(dds_txi_ABA)## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
For QC purposes
#Let's see how our samples cluster
vst_ABA <- vst(dds_ABA, blind = FALSE)
vst_ass <- assay(vst_ABA)
plotPCA(vst_ABA , intgroup = c("Condition"))#Looks like the majority of variance is due to time post-ABA treatment. The PCA only takes the top 10% variable genes. Let's see if we can generate a heatmap that visualizes the clustering of samples using all the genes. sampleDists <- dist(t(assay(vst_ABA)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vst_ABA$Condition) #assign informative rownames
colnames(sampleDistMatrix) <- paste(vst_ABA$Run) #assign informative column names
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(100)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col=colors)#There are a few samples of different timepoints that cluster together - this something to check on. But, let's continue for now#Introduce your comparisons of interest
#this is where you want to introduce the contrasts that you would like to make
res_ABA_2hr <- results(dds_ABA, contrast = c("Condition", "2hr_ABA", "0hr_ABA"))
#res_ABA_2hr
res_ABA_4hr <- results(dds_ABA, contrast = c("Condition", "4hr_ABA", "0hr_ABA"))
#res_ABA_4hr
res_ABA_6hr <- results(dds_ABA, contrast = c("Condition", "6hr_ABA", "0hr_ABA"))
#res_ABA_6hr
res_ABA_8hr <- results(dds_ABA, contrast = c("Condition", "8hr_ABA", "0hr_ABA"))
#res_ABA_8hr #Here we can visualize the most significanlty differentirally expressed gene at any specific time point
topGene <- rownames(res_ABA_2hr)[which.min(res_ABA_8hr$padj)] #grab the gene that has the smallest adjusted p-value for that time-point
plotCounts(dds_ABA, gene = topGene, intgroup = c("Condition")) #plot it
# Let’s generate an MA plot to visualize the total differential
expression
resultsNames(dds_ABA) #This will give you all the comparisons that you can look at ## [1] "Intercept" "Condition_2hr_ABA_vs_0hr_ABA"
## [3] "Condition_4hr_ABA_vs_0hr_ABA" "Condition_6hr_ABA_vs_0hr_ABA"
## [5] "Condition_8hr_ABA_vs_0hr_ABA"
res <- lfcShrink(dds_ABA, coef="Condition_8hr_ABA_vs_0hr_ABA", type="ashr") #normalize lfc at a specific timepoint## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-10,10)) #plot your MA plot to visualize the differentially expressed genes###Generate a raw counts and abundance dataframe from long-read data###
Read in your featureCounts table
at_ftc <- read.table("/mnt/Kanta/rna_seq_tutorial/at_drna_ftc.out", header=TRUE, row.names=1)
at_ftc <- at_ftc[ ,6:ncol(at_ftc)] #Only keep the columns containing raw counts for each gene (columns 6,7,8,9)
colnames(at_ftc) <- gsub("\\.[sb]am$", "", colnames(at_ftc)) #Clean up the column names
colnames(at_ftc) <- gsub("align_to_genome.", "", colnames(at_ftc))Generate dataframe for input into DESeq2
at_ftc <- as.matrix(at_ftc) #make this counts table into a matrix
condition <- factor(c(rep("ctl", 2), rep("stress", 2))) #save the conditions into an object
coldata <- data.frame(row.names=colnames(at_ftc), condition) # Create a coldata frame that maps each sample/column to conditions for DESeq2
dds_heat_mat <- DESeqDataSetFromMatrix(countData=at_ftc, colData=coldata, design=~condition) #generate the dataframe
# Interestd in pre-filtering genes before you run the differential expression? You can keep all genes that have at least 3 counts or more
#keep <- rowSums(counts(dds_heat_mat)) >= 3 # store the gene names that have at least 3 counts acorss the entire experiment
#dds_heat_mat_filt <- dds_heat_mat[keep,] # subset the dataframe keeping only the genes that meet the count threshholdRun the DESeq pipeline and save it for future use
dds_at <- DESeq(dds_heat_mat, parallel = T)## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates: 30 workers
## mean-dispersion relationship
## final dispersion estimates, fitting model and testing: 30 workers
vsd_at <- vst(dds_heat_mat, blind = F)
res_dds_at <- results(dds_at, contrast = c("condition", "stress", "ctl"))
write.csv(res_dds_at, file = "/mnt/Kanta/rna_seq_tutorial/res_dds_at.csv")-> Now, we can do some quality checking just like we did for short-read data
Generate a PCA plot
plotPCA(DESeqTransform(vsd_at))#Majority of our variance is explained by condition!#We can also generate a pheatmap that visualizes the sample alike-ness
sampleDists <- dist(t(assay(vsd_at)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd_at$condition)
colors <- colorRampPalette( rev(brewer.pal(9, "Reds")) )(100)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col=colors,
annotation_names_col = TRUE)#our samples cluster by conditions!#Visualize the most significanlty differentirally expressed gene at any specific time point
topGene <- rownames(res_dds_at)[which.min(res_dds_at$padj)] #grab the gene that has the smallest adjusted p-value for that time-point
plotCounts(dds_at, gene = topGene, intgroup = c("condition")) #plot it #MA plot to visualize the total differential expression
resultsNames(dds_at) #This will give you all the comparisons that you can look at ## [1] "Intercept" "condition_stress_vs_ctl"
res <- lfcShrink(dds_at, coef="condition_stress_vs_ctl", type="ashr") #normalize lfc at a specific timepoint## using 'ashr' for LFC shrinkage. If used in published research, please cite:
## Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
## https://doi.org/10.1093/biostatistics/kxw041
DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-10,10)) #plot your MA plot to visualize the differentially expressed genes