The generation of large amounts of transcriptomes by RNA sequencing requires the implementation of analysis pipelines that are efficient and reproducible. The goal of this tutorial is guide through the main steps of a pipeline developed for RNAseq analysis containing the tools necessary for this task. The pipeline is divided into four steps: data pre-processing, differential expression analysis, network construction, and functional annotation. Several packages are used to accomplish the job: FastqQC (Andrews et al. Acessado em 2023-07-10), Trimmomatic (Bolger, Lohse, and Usadel 2014), HISAT2 (Zhang et al. 2021), StringTie (Pertea et al. 2015), DESeq2 (Love, Huber, and Anders 2014), and WGCNA (Langfelder and Horvath 2008). The pre-processing includes quality assessment (FastQC), trimming of unwanted low-quality regions and sequencing adapters from raw sequencing data (Trimmomatic), mapping against the reference genome for transcript identification (HISAT2), and transcript counting (StringTie). For differential expression analysis, DESeq2 is based on a negative binomial statistical model to model the count of RNA-seq reads. For network analysis, WGCNA measures the correlation between genes and constructs modules (set of strongly interconnected genes). With three main R packages, it is possible to assign functional annotations to these genes, inluding gene ontology and pathway enrichment. The pipeline was tested with data from studies that provided RNAseq raw files and final lists of differentially expressed genes. RNAseq data were downloaded from dbGAP (phs000699.v1.p1) from a published study that characterized the gene expression profiles of osteosarcomas (Perry et al. 2014).
First of all, we need a terminal where we can have access to lots of RAM memory and hard drive space. Go to your local machine and create a new directory, where you will conduct your analysis. You can use the following commands:
[bash terminal]
mkdir directory_path
cd directory_path
Make sure you have Docker and Git installed in you machine. Try
typing docker in you terminal and, later, git.
If usage messages appear, you are ready to proceed. If not, you will
need to install the packages. There are plenty of good tutorials in the
internet to help you with this installation.
Next you need to clone our repo.
If you are not familiar with our GitHub page, check us out: https://github.com/meidanis-lab/trans-pipeline.
You will have to click on the green code block and copy the https link to your terminal, like this:
[bash terminal]
git clone https://github.com/meidanis-lab/trans-pipeline
If you cloned our repository correctly, you will see 15 directories. Their names start with 3-digit numbers that roughly mimic the processing order. Here is a brief summary of what each directory is meant to do:
from 005_preqc to 045_prepde, the data
will be pre-processed;
from 100_counts to 145_deg, the
transcript count matrix will go through differential expression
analysis;
in the 200_wgcna directory, network analysis will be
performed.
You will need to produce a suitable docker image using our
Dockerfile, which resides in the docs
directory. This file installs all dependencies necessary to run the
pipeline. Follow the next step to build the image, called
trans. This step will take a couple of hours, because all
needed packages will be installed.
[bash terminal]
docker build -t trans docs/
This part is responsible for pre-processing your sequencing data so you can perform the differential expression and network analyses, or any other analysis where counts matrices are needed. At the end, two matrices will be generated: one with transcript counts and the other with gene counts. As this tutorial is for the analysis of RNAseq samples, we are going to focus on the transcripts.
For this part, you will need to make a new directory named
000_input, with a subdirectory called fastq,
like this:
[bash terminal]
mkdir 000_input
mkdir 000_input/fastq
Place all your raw fastq files in the subdirectory
000_input/fastq.
Also, you will create two new directories, where your reference genome and index will be placed, as follows:
[bash terminal]
mv gen.index 000_input/
mv ref.genome 000_input/
The index files can be obtained from the HISAT2 webpage. There are a number of pre-build genomes there, that you can just download and place in the directory indicated above. We recommend downloading one with transcripts and SNPs.
The reference genome can be found in the following link:
<ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz>.
We will be using version 38 of the human genome here. To get the
reference genome, just do:
[bash terminal]
wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
unizp Homo_sapiens.GRCh38.84.gtf.gz
mv Homo_sapiens.GRCh38.84.gtf 000_input/ref.genome
With the input data already in place in the right directories, the
next part is easy, but it is going to take a large amount of time. Two
paired-end fastq files with 6.6 GB require about 3h of processing time
in our machine. The exact time for you will depend on the amount and
size of your samples, and on your processing power as well. It may be a
good idea to run this process as an background job, using a
nohup command.
Simply type the following command in your bash terminal, so that the
bash script named pre-proc.sh will perform all the steps
comprising the pre-processing part. Hours later you will have your
transcript count matrix named counts.csv in the
045_prepde directory.
[bash terminal]
nohup ./pre-proc.sh & > nohup_proc.out
You can check out the outut messages in a nohup.out file
generated by the analysis.
This script executes the following steps:
Quality control: FastQC will perform a quality test and generate a report for each sample. You can download these reports to analyse them at any time.
Trimming of samples: Trimmomatic will trim bases with less than 33 Phred score, remove adapters and reads with less than 36 base pairs.
Identification of transcripts: HISAT2 will identify the transcripts present in the samples by comparing them with the genome index. Because of this, make sure your genome index has transcripts and SNPs.
Quantification of transcripts: StringTie will quantify how many reads exist for each transcript in each sample, also by comparing them with a reference genome.
For the differential expression analysis, DESeq2 will be used.
You will need to place your metadata in the 105_meta directory. The
metadata is a file in .csv format that contains information
about comparisons between conditions. Its columns contain conditions and
the rows are the samples. They need to be in the same order as the
counts file.
Also, you will have to create a contrasts.txt file,
which contains the name of the columns of your metadata that you want to
analyse. It is a comma separated file that in each row has the name of a
column, followed by the control condition and the variable condition.
For example, absence of metastasis could be the control
condition, and metastasis the variable condition.
[bash terminal]
mkdir 105_meta
nano 105_meta/constrasts.txt
With all that settled up, just run the de-analysis.sh
script as it follows
[bash terminal]
nohup ./de-analysis.sh & > nohup_de.out
The results from differential expression analysis, such as how many
transcripts are up or down regulated for each comparison, will be in the
nohup_de.out file generated.
For network analysis, WGCNA will be used. This package performs an analysis of correlation between genes and the metadata. If genes have similar expression profiles, they will be grouped in the same module. This is interesting because genes that have the same profile expression may be correlated, so we can infer that they belong to the same pathway.
In this step, WGCNA will measure the correlation between genes by separating them into modules and later correlating them with metadata. We encourage you to explore the files generated from this analysis. Visit the WGCNA site for further insights.
To run the network.sh script, you will need to write
another file with the metadata you want to normalize your counts with.
Just think of some of the metadata that makes sense comparing with, like
status, age or years.
[bash terminal]
nano 200_wgcna/design.txt
Just place the name of the column of interest, like this
Now, we can run the script. It is not necessary to run it with
nohup, because it is a fast analysis, but if you feel like
saving the messages that the script generates, you may run it as the
second line below suggests:
[bash terminal]
./network.sh
nohup ./network.sh & > nohup_net.out
This analysis will generate:
TOM files: contains the topological overlap matrix of genes;
Edges files: contains the correlation of the connection between genes;
Nodes: contains the gene names that belong to the edges files;
Annotation files: contains the first step of annotation, like gene names for the corresponding transcripts IDs.
The edges and nodes files are in a format in which they can be used as input to GEPHI (Bastian, Heymann, and Jacomy 2009). In this program you can build your own network.
All the graphs generated during the analysis were downloaded to the
local directory (200_wgcna) and can be visualized as you
pleased.The names of the graphs and their specifications are described
below:
soft_power.png: two graphs are generated, where the first one indicates the scale independence or signed R². The other one indicates the mean connectivity. The soft thresholding power beta is chosen by where the red line crosses the lowest mean connectivity and the highest R². This soft thresholding power beta corresponds to the threshold at which correlation values are elevated so that the similarity matrix is converted into an adjacency matrix with an approximate scale-free topology.
outliers.png: it shows the clustering of samples by their Euclidean distance. It’s possible to detect outliers using this graph.
dendogram.png: the sample dendogram and trait heatmap helps understand the sample’s profile based on the metadata. It’s a hierarchical grouping of the samples and expression across different clinical data.
heatmap.png: this heatmap corresponds to the module trait relationship between the module eigengenes and the metadata. Each row corresponds to a module eigengene, column to a trait. Each cell contains the corresponding correlation and p-value. The table is color-coded by correlation according to the color legend.
kegg.png: Enrichment pathway analysis classification for differentially expression genes from modules. Only showing the 10 most significant classifications.
out_GO.png: Gene Ontology classification for differentially expression genes from modules. Only showing the 10 most significant classifications.
Also, functional annotation is performed in this part. We select only the ten most significant annotations from gene ontology and pathway enrichment. The graphs generated from it can be downloaded to your machine and then analysed. You will have graphs like this one: