Pipeline of pre-processing, differential expression analysis, network analysis and functional annotation of transcripts.

The generation of large amounts of transcriptomes by RNA sequencing requires the implementation of analysis pipelines that are efficient and reproducible. The goal of this tutorial is guide through the main steps of a pipeline developed for RNAseq analysis containing the tools necessary for this task. The pipeline is divided into four steps: data pre-processing, differential expression analysis, network construction, and functional annotation. Several packages are used to accomplish the job: FastqQC (Andrews et al. Acessado em 2023-07-10), Trimmomatic (Bolger, Lohse, and Usadel 2014), HISAT2 (Zhang et al. 2021), StringTie (Pertea et al. 2015), DESeq2 (Love, Huber, and Anders 2014), and WGCNA (Langfelder and Horvath 2008). The pre-processing includes quality assessment (FastQC), trimming of unwanted low-quality regions and sequencing adapters from raw sequencing data (Trimmomatic), mapping against the reference genome for transcript identification (HISAT2), and transcript counting (StringTie). For differential expression analysis, DESeq2 is based on a negative binomial statistical model to model the count of RNA-seq reads. For network analysis, WGCNA measures the correlation between genes and constructs modules (set of strongly interconnected genes). With three main R packages, it is possible to assign functional annotations to these genes, inluding gene ontology and pathway enrichment. The pipeline was tested with data from studies that provided RNAseq raw files and final lists of differentially expressed genes. RNAseq data were downloaded from dbGAP (phs000699.v1.p1) from a published study that characterized the gene expression profiles of osteosarcomas (Perry et al. 2014).

Environment set-up

First of all, we need a terminal where we can have access to lots of RAM memory and hard drive space. Go to your local machine and create a new directory, where you will conduct your analysis. You can use the following commands:

[bash terminal]

mkdir directory_path

cd directory_path

Make sure you have Docker and Git installed in you machine. Try typing docker in you terminal and, later, git. If usage messages appear, you are ready to proceed. If not, you will need to install the packages. There are plenty of good tutorials in the internet to help you with this installation.

Next you need to clone our repo.
If you are not familiar with our GitHub page, check us out: https://github.com/meidanis-lab/trans-pipeline.

You will have to click on the green code block and copy the https link to your terminal, like this:

[bash terminal]

git clone https://github.com/meidanis-lab/trans-pipeline

If you cloned our repository correctly, you will see 15 directories. Their names start with 3-digit numbers that roughly mimic the processing order. Here is a brief summary of what each directory is meant to do:

from 005_preqc to 045_prepde, the data will be pre-processed;
from 100_counts to 145_deg, the transcript count matrix will go through differential expression analysis;
in the 200_wgcna directory, network analysis will be performed.

You will need to produce a suitable docker image using our Dockerfile, which resides in the docs directory. This file installs all dependencies necessary to run the pipeline. Follow the next step to build the image, called trans. This step will take a couple of hours, because all needed packages will be installed.

[bash terminal]

docker build -t trans docs/

Pre-processing

This part is responsible for pre-processing your sequencing data so you can perform the differential expression and network analyses, or any other analysis where counts matrices are needed. At the end, two matrices will be generated: one with transcript counts and the other with gene counts. As this tutorial is for the analysis of RNAseq samples, we are going to focus on the transcripts.

For this part, you will need to make a new directory named 000_input, with a subdirectory called fastq, like this:

[bash terminal]

mkdir 000_input 

mkdir 000_input/fastq

Place all your raw fastq files in the subdirectory 000_input/fastq.

Also, you will create two new directories, where your reference genome and index will be placed, as follows:

[bash terminal]

mv gen.index 000_input/

mv ref.genome 000_input/

The index files can be obtained from the HISAT2 webpage. There are a number of pre-build genomes there, that you can just download and place in the directory indicated above. We recommend downloading one with transcripts and SNPs.

The reference genome can be found in the following link: <ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz>. We will be using version 38 of the human genome here. To get the reference genome, just do:

[bash terminal]

wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz 

unizp Homo_sapiens.GRCh38.84.gtf.gz

mv Homo_sapiens.GRCh38.84.gtf 000_input/ref.genome

With the input data already in place in the right directories, the next part is easy, but it is going to take a large amount of time. Two paired-end fastq files with 6.6 GB require about 3h of processing time in our machine. The exact time for you will depend on the amount and size of your samples, and on your processing power as well. It may be a good idea to run this process as an background job, using a nohup command.

Simply type the following command in your bash terminal, so that the bash script named pre-proc.sh will perform all the steps comprising the pre-processing part. Hours later you will have your transcript count matrix named counts.csv in the 045_prepde directory.

[bash terminal]

nohup ./pre-proc.sh & > nohup_proc.out

You can check out the outut messages in a nohup.out file generated by the analysis.

This script executes the following steps:

Quality control: FastQC will perform a quality test and generate a report for each sample. You can download these reports to analyse them at any time.
Trimming of samples: Trimmomatic will trim bases with less than 33 Phred score, remove adapters and reads with less than 36 base pairs.
Identification of transcripts: HISAT2 will identify the transcripts present in the samples by comparing them with the genome index. Because of this, make sure your genome index has transcripts and SNPs.
Quantification of transcripts: StringTie will quantify how many reads exist for each transcript in each sample, also by comparing them with a reference genome.

Differential expression analysis

For the differential expression analysis, DESeq2 will be used.

You will need to place your metadata in the 105_meta directory. The metadata is a file in .csv format that contains information about comparisons between conditions. Its columns contain conditions and the rows are the samples. They need to be in the same order as the counts file.

Also, you will have to create a contrasts.txt file, which contains the name of the columns of your metadata that you want to analyse. It is a comma separated file that in each row has the name of a column, followed by the control condition and the variable condition. For example, absence of metastasis could be the control condition, and metastasis the variable condition.

Metastasis,no,Lung

[bash terminal]

mkdir 105_meta

nano 105_meta/constrasts.txt

With all that settled up, just run the de-analysis.sh script as it follows

[bash terminal]

nohup ./de-analysis.sh & > nohup_de.out

The results from differential expression analysis, such as how many transcripts are up or down regulated for each comparison, will be in the nohup_de.out file generated.

Network analysis

For network analysis, WGCNA will be used. This package performs an analysis of correlation between genes and the metadata. If genes have similar expression profiles, they will be grouped in the same module. This is interesting because genes that have the same profile expression may be correlated, so we can infer that they belong to the same pathway.

In this step, WGCNA will measure the correlation between genes by separating them into modules and later correlating them with metadata. We encourage you to explore the files generated from this analysis. Visit the WGCNA site for further insights.

To run the network.sh script, you will need to write another file with the metadata you want to normalize your counts with. Just think of some of the metadata that makes sense comparing with, like status, age or years.

[bash terminal]

nano 200_wgcna/design.txt

Just place the name of the column of interest, like this

Status

Now, we can run the script. It is not necessary to run it with nohup, because it is a fast analysis, but if you feel like saving the messages that the script generates, you may run it as the second line below suggests:

[bash terminal]

./network.sh

nohup ./network.sh & > nohup_net.out

This analysis will generate:

TOM files: contains the topological overlap matrix of genes;
Edges files: contains the correlation of the connection between genes;
Nodes: contains the gene names that belong to the edges files;
Annotation files: contains the first step of annotation, like gene names for the corresponding transcripts IDs.

The edges and nodes files are in a format in which they can be used as input to GEPHI (Bastian, Heymann, and Jacomy 2009). In this program you can build your own network.

All the graphs generated during the analysis were downloaded to the local directory (200_wgcna) and can be visualized as you pleased.The names of the graphs and their specifications are described below:

soft_power.png: two graphs are generated, where the first one indicates the scale independence or signed R². The other one indicates the mean connectivity. The soft thresholding power beta is chosen by where the red line crosses the lowest mean connectivity and the highest R². This soft thresholding power beta corresponds to the threshold at which correlation values are elevated so that the similarity matrix is converted into an adjacency matrix with an approximate scale-free topology.
outliers.png: it shows the clustering of samples by their Euclidean distance. It’s possible to detect outliers using this graph.
dendogram.png: the sample dendogram and trait heatmap helps understand the sample’s profile based on the metadata. It’s a hierarchical grouping of the samples and expression across different clinical data.
heatmap.png: this heatmap corresponds to the module trait relationship between the module eigengenes and the metadata. Each row corresponds to a module eigengene, column to a trait. Each cell contains the corresponding correlation and p-value. The table is color-coded by correlation according to the color legend.
kegg.png: Enrichment pathway analysis classification for differentially expression genes from modules. Only showing the 10 most significant classifications.
out_GO.png: Gene Ontology classification for differentially expression genes from modules. Only showing the 10 most significant classifications.

Also, functional annotation is performed in this part. We select only the ten most significant annotations from gene ontology and pathway enrichment. The graphs generated from it can be downloaded to your machine and then analysed. You will have graphs like this one:

References

Andrews, Simon et al. Acessado em 2023-07-10. “FastQC.” https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Bastian, Mathieu, Sebastien Heymann, and Mathieu Jacomy. 2009. “Gephi: An Open Source Software for Exploring and Manipulating Networks.” In Proceedings of the 3rd International AAAI Conference on Web and Social Media, 3(1):361–62. Burnaby, Canada: PKS Publishing Services. https://doi.org/10.1609/icwsm.v3i1.13937.

Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics 30 (15): 2114–20. https://doi.org/10.1093/bioinformatics/btu170.

Langfelder, Peter, and Steve Horvath. 2008. “WGCNA: An R Package for Weighted Correlation Network Analysis.” BMC Bioinformatics 9 (1): 1–13.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Perry, Jennifer A., Adam Kiezun, Peter Tonzi, Eliezer M. Van Allen, Scott L. Carter, Sylvan C. Baca, Glenn S. Cowley, et al. 2014. “Complementary Genomic Approaches Highlight the PI3K/mTOR Pathway as a Common Vulnerability in Osteosarcoma.” Proceedings of the National Academy of Sciences 111 (51). https://doi.org/10.1073/pnas.1419260111.

Pertea, Mihaela, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell, and Steven L Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from RNA-Seq Reads.” Nature Biotechnology 33 (3): 290–95. https://doi.org/10.1038/nbt.3122.

Zhang, Yun, Chanhee Park, Christopher Bennett, Micah Thornton, and Daehwan Kim. 2021. “Rapid and Accurate Alignment of Nucleotide Conversion Sequencing Reads with HISAT-3N.” Genome Research 31 (7): 1290–95.