RNA-seq data analysis

RNA-sequencing (RNA-seq) has a wide variety of applications, but no single analysis pipeline can be used in all cases. To review all the necessary tools and procedures please go through the following publication:

Conesa, A., Madrigal, P., Tarazona, S. et al. A survey of best practices for RNA-seq data analysis. Genome Biol 17, 13 (2016). https://doi.org/10.1186/s13059-016-0881-8 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

The basic workflow for RNA-seq data analysis is:

A caption

A caption

The minimum hardware requirment for running this pipeline on a human genome is:

16GB RAM

1 TB free space (This may differ based on the number of samples to be processed).

GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles. GEO requires raw data, processed data and metadata. Raw data facilitates the unambiguous interpretation of the data and potential verification of conclusions. For microarray data, raw data may be supplied either within the Sample record data tables or as external supplementary data files, e.g., Affymetrix CEL. For high-throughput sequencing, GEO brokers the complete set of raw data files, e.g., FASTQ, to the SRA database on your behalf.

One can download the raw data from the SRA database using sratoolkit.

https://github.com/ncbi/sra-tools

The command to download a file from GEO database in a fastq format is:

fastq-dump --split-files SRR10678267

(If the data is generated using a paired-end library)

fastq-dump SRR10678267

(If the data is generated using a single-end library)

Once the data is downloaded you can check the quality of the data using FastQC and perform quality control analysis using cutadapt. The tutorial for both of these tools are found here:

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/ (for FastQC)

https://cutadapt.readthedocs.io/en/stable/guide.html (for cutadapt)

Once the data is pre-processed and checked for quality measures, it can be aligned to the reference genome. For alignment and quantification you will require two files:

  1. Reference genome file (fasta format)

  2. Gene annotation file (GTF/GFF format)

When working with the human genome one can either download the hg19 or hg38 version. The most updated version of human genome is hg38 (GRCh38). The reference files can be downloaded from the following links:

https://www.ncbi.nlm.nih.gov/genome/?term=human (from NCBI)

https://www.gencodegenes.org/human/ (from gencode)

https://www.ensembl.org/info/data/ftp/index.html (from ensembl)

Though each of the of these databases have the same sequences for the human genome, the nomenclature of these sequences differ from database to database. For eg: GENCODE uses the UCSC convention of prefixing chromosome names with “chr”, e.g. “chr1” and “chrM”, but Ensembl calls these “1” or “MT”. At the time of writing (Ensembl 89), a few transcripts differ due to conversion issues. In addition, around 160 PAR genes are duplicated in GENCODE but only once in Ensembl. The differences affect fewer than 1% of the transcripts. Apart from gene annotation itself, the links to external databases differ.

To determine where on the human genome our reads originated from, one will align the reads to the reference genome using STAR (Spliced Transcripts Alignment to a Reference). STAR is an aligner designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced alignments. Please refer to the following link to understand the working of the STAR algorithm:

https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html

Before starting with the reads alignment, one needs to index the reference genome using the following command:

STAR --runThreadN n --runMode genomeGenerate --genomeDir dir --genomeFastaFiles reference --sjdbGTFfile gftFile

where n = number of threads/cores (This parameter depends on the number of cores available in your system).

dir = path to the folder where one wishes to store the index files

reference = reference genome file in FASTA format

gtfFile = gene annotation file in GTF/GFF format

Example:

STAR --runThreadN 32 --runMode genomeGenerate --genomeDir ./hg38/ --genomeFastaFiles GRCh38_latest_genomic.fna --sjdbGTFfile GRCh38_latest_genomic.gff

Remember that the indexed files need to be generated only once. This is not a repetitive step.

Once the indexing is done you can move on to the alignment step. The command for alignment is:

STAR --genomeDir genomeDir --runThreadN n --readFilesIn R1.fastq R2.fastq --outFileNamePrefix out --outSAMtype BAM SortedByCoordinate

where n = number of threads/cores (This parameter depends on the number of cores available in your system).

genomeDir = path to the folder where the index files are stored

R1/R2.fastq = input FASTQ files

out = prefix for all output files

outSAMtype = output filetype (BAM+sorted output suggested)

Example:

STAR --genomeDir ./hg38/ --runThreadN 32 --readFilesIn SRR10678267_1.fastq SRR10678267_2.fastq --outFileNamePrefix SRR10678267 --outSAMtype BAM SortedByCoordinate

The output BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited text file that contains sequence alignment data. These formats are described on the SAM Tools web site: http://samtools.github.io/hts-specs/.

It is also a good practice to index the output BAM file using the samtools command.

samtools index BAMFILE

Example:

samtools index SRR10678267Aligned.sortedByCoord.out.bam

Once the alignment is done, the next step is the quantification of the data. You can use bedtools coverage function for quantification of RNA-seq data using the command:

bedtools coverage -a gtfFile -b BAMFILE > out.gtf

where, gftFile = gene annotation file in GTF/GFF format and

BAMFILE = output BAM file from STAR alignment

Example:

bedtools coverage -a GRCh38_latest_genomic.gff -b SRR10678267Aligned.sortedByCoord.out.bam > SRR10678267.gtf

After each interval in A, bedtools coverage will report:

  1. The number of features in B that overlapped (by at least one base pair) the A interval.

  2. The number of bases in A that had non-zero coverage from features in B.

  3. The length of the entry in A.

  4. The fraction of bases in A that had non-zero coverage from features in B.

All these columns will be concatenated to the end of the GTF file for the sample. The 1st output column of bedtools (10th column from the output GTF file) is the gene count column. We require this column for further analysis.

Please remember to provide absolute paths for each of the files used in the commands above.

For identifying the differential expression genes (DEGs) one needs to have minimum of three samples per condition for analysis. You will need to run the above commands for all the samples that need to be included in the analysis.

Once the alignment and quantification process is complete for all the samples, expression matrix needs to be created for further analysis. For each output GTF file we will only extract the protein-coding gene entries using the grep function and extract only the 9th and 10th columns (gene annotation and gene count) using the cut function.

grep -P "\tgene\t" SRR10678267.gtf | grep "protein_coding" | cut -f 9,10 > SRR10678267.txt

You will run the same command for all the GTF files in turn generating the same number of TXT files.

Now to extract the gene names from the TXT file, we again use the cut command.

The first column of the TXT file should have entries like:

ID=gene-OR4F5;Dbxref=GeneID:79501,HGNC:HGNC:14825;Name=OR4F5;description=olfactory receptor family 4 subfamily F member 5;gbkey=Gene;gene=OR4F5;gene_biotype=protein_coding

Here the first attribute is the gene name which we will extract using the following command:

cut -f 1 -d ";" SRR10678267.txt > genes

The -f option allows you to determine which column needs to be extracted. If the gene name information is in the second column, one can use the parameter -f 2 to extract the particular column.

You do not need to do this step for all the files, as the gene entries would be the same for all of them.

The further compilation script needs to be run in the R environment.

Please set the path to the folder where all the files are present.

#set the working directory using the function setwd())
setwd("/home/pranali")
#List all the TXT files in the directory
files  = list.files(".", pattern = "*.txt")

#Check that only the sample TXT are selected. If any other file is selected remove it from the files objects. 
files
## [1] "SRR10678284.txt" "SRR10678285.txt" "SRR10678286.txt" "SRR10678287.txt"
for(i in 1:length(files)){
  #read each of the TXT files
  data = read.delim(files[i], header=  F)
  if(i == 1){
    #create a object called mat for storing all the count data
    mat = as.data.frame(data[,2])
  } else {
    #concatenate the second column of each TXT file
    mat = cbind.data.frame(mat, data[,2])
  }
}

#set the column header as the file names, as each filename represents the sample ID
colnames(mat) = gsub(".txt", "", files)
head(mat)
##   SRR10678284 SRR10678285 SRR10678286 SRR10678287
## 1          49          20          86         127
## 2         309         977        1178         475
## 3          96          43         149          71
## 4         163         309         348         168
## 5         432         340         682         342
## 6        1980        2175        4210        2378
genes = read.delim("genes", header = F)
head(genes)
##                     V1
## 1        ID=gene-OR4F5
## 2 ID=gene-LOC112268260
## 3       ID=gene-OR4F29
## 4 ID=gene-LOC105378947
## 5       ID=gene-OR4F16
## 6       ID=gene-SAMD11
#replace the "ID=gene-" string with nothing to just keep the gene symbols
genes[,1] = gsub("ID=gene-", "", genes[,1])
head(genes)
##             V1
## 1        OR4F5
## 2 LOC112268260
## 3       OR4F29
## 4 LOC105378947
## 5       OR4F16
## 6       SAMD11
#save the mat object for the next snippet of averaging over multiple gene counts
mat2 = mat

#assign the gene symbols as row names to the mat object
rownames(mat) = genes[,1]
head(mat)
##              SRR10678284 SRR10678285 SRR10678286 SRR10678287
## OR4F5                 49          20          86         127
## LOC112268260         309         977        1178         475
## OR4F29                96          43         149          71
## LOC105378947         163         309         348         168
## OR4F16               432         340         682         342
## SAMD11              1980        2175        4210        2378
#write the matrix out to a file
write.csv(mat, file = "count.csv")

If you have duplicate gene entries in your data you can take average of the gene count for the duplicate entries. Please refer to the following snippet:

#concatenate the 
mat1 = cbind.data.frame(genes[,1], mat2)
mat1[1:5,1:5]
##     genes[, 1] SRR10678284 SRR10678285 SRR10678286 SRR10678287
## 1        OR4F5          49          20          86         127
## 2 LOC112268260         309         977        1178         475
## 3       OR4F29          96          43         149          71
## 4 LOC105378947         163         309         348         168
## 5       OR4F16         432         340         682         342
#hererwe take the average of all the features with the same symbols
mat1 = aggregate(mat1[, 2:ncol(mat1)], list(mat1[,1]), mean)

rownames(mat1) = mat1[,1]
mat1 = mat1[,-1]

head(mat1)
##         SRR10678284 SRR10678285 SRR10678286 SRR10678287
## A1BG            289         365        4383        7282
## A1CF             64         220         679          12
## A2M             138        1795         953         135
## A2ML1           292         322         586         186
## A3GALT2         386         294        1076         564
## A4GALT          260         145         332        1622
#save the mat1 object for further analysis.

Once the expression matrix is generated the final step in the differential expression analysis workflow is fitting the raw counts to the NB model and performing the statistical test for differentially expressed genes. In this step we essentially want to determine whether the mean expression levels of different sample groups are significantly different.

For this analysis we will use the RNA-seq data obtained from the EBI Expression Atlas (GXA). Specifically data set E-GEOD-50760 which corresponds to PMID: 25049118. This data consists of 54 samples from 18 individuals. Each individual has a primary colorectal cancer sample, a metastatic liver sample, and a normal sample of the surrounding colonic epithilium. The quantification data required to run differential expression analysis using DEseq2 are raw readcounts for either genes or transcripts. We will use the output count matrix as a starting point.

The count data can be downloaded from: https://www.ebi.ac.uk/gxa/experiments-content/E-GEOD-50760/resources/DifferentialSecondaryDataFiles.RnaSeq/raw-counts

The sample information data can be downloaded from: https://www.ebi.ac.uk/gxa/experiments-content/E-GEOD-50760/resources/ExperimentDesignFile.RnaSeq/experiment-design

#set the working directory using the function setwd())
setwd("/home/pranali")

# Install the latest version of DEseq2
#Remember that you need to install the package only once but load it every time you start a new R session.
#if (!requireNamespace("BiocManager", quietly = TRUE))
 #   install.packages("BiocManager")
#BiocManager::install("DESeq2")

# load the library
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
# Read in the raw read counts
rawCounts = read.delim("E-GEOD-50760-raw-counts.tsv")
head(rawCounts)
##           Gene.ID Gene.Name SRR975562 SRR975555 SRR975597 SRR975571 SRR975553
## 1 ENSG00000000003    TSPAN6      1990       874      1011      1366       867
## 2 ENSG00000000005      TNMD        12         4         6        11        12
## 3 ENSG00000000419      DPM1       681       978       610       432       293
## 4 ENSG00000000457     SCYL3       165       221       138       309       207
## 5 ENSG00000000460  C1orf112        88       109        88        42        38
## 6 ENSG00000000938       FGR        60        84        81       121        89
##   SRR975567 SRR975590 SRR975581 SRR975601 SRR975556 SRR975574 SRR975566
## 1       619      1800       385      1523       754       764      2118
## 2         1         3         2         7         3         9         5
## 3       400       560       157       547       188       182       485
## 4       248       158       146       197       150       158        99
## 5       108       116        27        97        15        35        89
## 6        93        66        47        61        39        61       120
##   SRR975554 SRR975589 SRR975598 SRR975586 SRR975564 SRR975558 SRR975563
## 1      1957       669      3570      1145       264       759       998
## 2        13         1        24         1         1         1         2
## 3       646       455       716       353       403       374       372
## 4       119       253       232       144       136       247       299
## 5       119       179       158        41        96        88        45
## 6        65       413        38        43       216       137        55
##   SRR975557 SRR975578 SRR975585 SRR975569 SRR975572 SRR975551 SRR975593
## 1       116       594       672      1726      1324      3810      1457
## 2         6         4         5         9        10        45         8
## 3       149       145       283       456       277      1574       759
## 4        82       133       222       330       255       250       277
## 5        21        13        36        72        34       232        85
## 6       137        42        48        67        59        31       322
##   SRR975552 SRR975568 SRR975588 SRR975582 SRR975559 SRR975600 SRR975584
## 1       775      1501      1036       688      1202       401       575
## 2         1         2         6         9         3         1         5
## 3       404       669       435       264       588       345       247
## 4       357       174       329       199        97       138       177
## 5       188        92        61        34        94        81        39
## 6        59        53        52        58        66       342        56
##   SRR975595 SRR975579 SRR975577 SRR975591 SRR975592 SRR975583 SRR975599
## 1      1660       191       774       992       736      1019       411
## 2        10         7        13         2         4         3         8
## 3       477       116       206       643       479       244       223
## 4       131        87       214       187       100       223        62
## 5        56        13        30        92       100        52        42
## 6        60        33        50       372       112        43        66
##   SRR975603 SRR975575 SRR975596 SRR975565 SRR975604 SRR975594 SRR975560
## 1       974       521       968      2511      1144       566      1065
## 2         0         2         4        14         3         0         3
## 3       729       220       647       777       689       134       384
## 4       207       150        73       248       185        89       133
## 5       141        38        53       126        50        25        37
## 6       346        53        93        48       378        74        66
##   SRR975580 SRR975570 SRR975602 SRR975561 SRR975573 SRR975587 SRR975576
## 1      1596       394      1451       804       571      1825       783
## 2        17         1         0         4        13         2         2
## 3       444       157       560       301       229       717       192
## 4       178       133        85       147       182       165       146
## 5        66        20        84        21        37       138        19
## 6        59        23        38        54        44       124        46
# Read in the sample mappings
sampleData = read.delim("E-GEOD-50760-experiment-design.tsv")
head(sampleData)
##         Run Sample.Characteristic.biopsy.site.
## 1 SRR975551                      primary tumor
## 2 SRR975552                      primary tumor
## 3 SRR975553                      primary tumor
## 4 SRR975554                      primary tumor
## 5 SRR975555                      primary tumor
## 6 SRR975556                      primary tumor
##   Sample.Characteristic.Ontology.Term.biopsy.site.
## 1             http://www.ebi.ac.uk/efo/EFO_0000616
## 2             http://www.ebi.ac.uk/efo/EFO_0000616
## 3             http://www.ebi.ac.uk/efo/EFO_0000616
## 4             http://www.ebi.ac.uk/efo/EFO_0000616
## 5             http://www.ebi.ac.uk/efo/EFO_0000616
## 6             http://www.ebi.ac.uk/efo/EFO_0000616
##   Sample.Characteristic.disease. Sample.Characteristic.Ontology.Term.disease.
## 1              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 2              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 3              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 4              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 5              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 6              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
##   Sample.Characteristic.disease.staging.
## 1             Stage IV Colorectal Cancer
## 2             Stage IV Colorectal Cancer
## 3             Stage IV Colorectal Cancer
## 4             Stage IV Colorectal Cancer
## 5             Stage IV Colorectal Cancer
## 6             Stage IV Colorectal Cancer
##   Sample.Characteristic.Ontology.Term.disease.staging.
## 1                                                   NA
## 2                                                   NA
## 3                                                   NA
## 4                                                   NA
## 5                                                   NA
## 6                                                   NA
##   Sample.Characteristic.individual.
## 1                             AMC_2
## 2                             AMC_3
## 3                             AMC_5
## 4                             AMC_6
## 5                             AMC_7
## 6                             AMC_8
##   Sample.Characteristic.Ontology.Term.individual.
## 1                                              NA
## 2                                              NA
## 3                                              NA
## 4                                              NA
## 5                                              NA
## 6                                              NA
##   Sample.Characteristic.organism. Sample.Characteristic.Ontology.Term.organism.
## 1                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 2                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 3                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 4                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 5                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 6                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
##   Sample.Characteristic.organism.part.
## 1                                colon
## 2                                colon
## 3                                colon
## 4                                colon
## 5                                colon
## 6                                colon
##   Sample.Characteristic.Ontology.Term.organism.part. Factor.Value.biopsy.site.
## 1      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 2      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 3      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 4      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 5      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 6      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
##   Factor.Value.Ontology.Term.biopsy.site. Analysed
## 1    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 2    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 3    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 4    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 5    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 6    http://www.ebi.ac.uk/efo/EFO_0000616      Yes

The first thing we do is coerce the data frame containing the read counts into a format DESeq2 can accept. Specifically this must be a matrix with row names as genomic features (i.e. genes), and column names as samples. Next DESeq2 requires a data frame specifying the mapping of samples to variables, we load this in and clean it up some keeping only the variables we care about and making sure everything is a factor. For DEseq2 to work properly the column names of the count matrix must be in the same order as the row names of the sample mapping data, to ensure this we re-order the column names of the count data and run a check to ensure this has occurred correctly. To take advantage of the default settings of DEseq2 the control of the variable of interest, in our case the tissue type, should be the first element in the levels of that variable. Because we have more than 2 conditions for this variable we will not be taking advantage of the default settings however it’s good to get into the practice of doing this so we do it here. We then create a DEseq2DataSet object with this information and supply a formula where we use the individual id as a blocking factor and tissue type as the comparison variable.

# Convert count data to a matrix of appropriate form that DEseq2 can read
geneID <- rawCounts$Gene.ID
sampleIndex <- grepl("SRR\\d+", colnames(rawCounts))
rawCounts <- as.matrix(rawCounts[,sampleIndex])
rownames(rawCounts) <- geneID
head(rawCounts)
##                 SRR975562 SRR975555 SRR975597 SRR975571 SRR975553 SRR975567
## ENSG00000000003      1990       874      1011      1366       867       619
## ENSG00000000005        12         4         6        11        12         1
## ENSG00000000419       681       978       610       432       293       400
## ENSG00000000457       165       221       138       309       207       248
## ENSG00000000460        88       109        88        42        38       108
## ENSG00000000938        60        84        81       121        89        93
##                 SRR975590 SRR975581 SRR975601 SRR975556 SRR975574 SRR975566
## ENSG00000000003      1800       385      1523       754       764      2118
## ENSG00000000005         3         2         7         3         9         5
## ENSG00000000419       560       157       547       188       182       485
## ENSG00000000457       158       146       197       150       158        99
## ENSG00000000460       116        27        97        15        35        89
## ENSG00000000938        66        47        61        39        61       120
##                 SRR975554 SRR975589 SRR975598 SRR975586 SRR975564 SRR975558
## ENSG00000000003      1957       669      3570      1145       264       759
## ENSG00000000005        13         1        24         1         1         1
## ENSG00000000419       646       455       716       353       403       374
## ENSG00000000457       119       253       232       144       136       247
## ENSG00000000460       119       179       158        41        96        88
## ENSG00000000938        65       413        38        43       216       137
##                 SRR975563 SRR975557 SRR975578 SRR975585 SRR975569 SRR975572
## ENSG00000000003       998       116       594       672      1726      1324
## ENSG00000000005         2         6         4         5         9        10
## ENSG00000000419       372       149       145       283       456       277
## ENSG00000000457       299        82       133       222       330       255
## ENSG00000000460        45        21        13        36        72        34
## ENSG00000000938        55       137        42        48        67        59
##                 SRR975551 SRR975593 SRR975552 SRR975568 SRR975588 SRR975582
## ENSG00000000003      3810      1457       775      1501      1036       688
## ENSG00000000005        45         8         1         2         6         9
## ENSG00000000419      1574       759       404       669       435       264
## ENSG00000000457       250       277       357       174       329       199
## ENSG00000000460       232        85       188        92        61        34
## ENSG00000000938        31       322        59        53        52        58
##                 SRR975559 SRR975600 SRR975584 SRR975595 SRR975579 SRR975577
## ENSG00000000003      1202       401       575      1660       191       774
## ENSG00000000005         3         1         5        10         7        13
## ENSG00000000419       588       345       247       477       116       206
## ENSG00000000457        97       138       177       131        87       214
## ENSG00000000460        94        81        39        56        13        30
## ENSG00000000938        66       342        56        60        33        50
##                 SRR975591 SRR975592 SRR975583 SRR975599 SRR975603 SRR975575
## ENSG00000000003       992       736      1019       411       974       521
## ENSG00000000005         2         4         3         8         0         2
## ENSG00000000419       643       479       244       223       729       220
## ENSG00000000457       187       100       223        62       207       150
## ENSG00000000460        92       100        52        42       141        38
## ENSG00000000938       372       112        43        66       346        53
##                 SRR975596 SRR975565 SRR975604 SRR975594 SRR975560 SRR975580
## ENSG00000000003       968      2511      1144       566      1065      1596
## ENSG00000000005         4        14         3         0         3        17
## ENSG00000000419       647       777       689       134       384       444
## ENSG00000000457        73       248       185        89       133       178
## ENSG00000000460        53       126        50        25        37        66
## ENSG00000000938        93        48       378        74        66        59
##                 SRR975570 SRR975602 SRR975561 SRR975573 SRR975587 SRR975576
## ENSG00000000003       394      1451       804       571      1825       783
## ENSG00000000005         1         0         4        13         2         2
## ENSG00000000419       157       560       301       229       717       192
## ENSG00000000457       133        85       147       182       165       146
## ENSG00000000460        20        84        21        37       138        19
## ENSG00000000938        23        38        54        44       124        46
# Convert sample variable mappings to an appropriate form that DESeq2 can read
head(sampleData)
##         Run Sample.Characteristic.biopsy.site.
## 1 SRR975551                      primary tumor
## 2 SRR975552                      primary tumor
## 3 SRR975553                      primary tumor
## 4 SRR975554                      primary tumor
## 5 SRR975555                      primary tumor
## 6 SRR975556                      primary tumor
##   Sample.Characteristic.Ontology.Term.biopsy.site.
## 1             http://www.ebi.ac.uk/efo/EFO_0000616
## 2             http://www.ebi.ac.uk/efo/EFO_0000616
## 3             http://www.ebi.ac.uk/efo/EFO_0000616
## 4             http://www.ebi.ac.uk/efo/EFO_0000616
## 5             http://www.ebi.ac.uk/efo/EFO_0000616
## 6             http://www.ebi.ac.uk/efo/EFO_0000616
##   Sample.Characteristic.disease. Sample.Characteristic.Ontology.Term.disease.
## 1              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 2              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 3              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 4              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 5              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
## 6              colorectal cancer         http://www.ebi.ac.uk/efo/EFO_0005842
##   Sample.Characteristic.disease.staging.
## 1             Stage IV Colorectal Cancer
## 2             Stage IV Colorectal Cancer
## 3             Stage IV Colorectal Cancer
## 4             Stage IV Colorectal Cancer
## 5             Stage IV Colorectal Cancer
## 6             Stage IV Colorectal Cancer
##   Sample.Characteristic.Ontology.Term.disease.staging.
## 1                                                   NA
## 2                                                   NA
## 3                                                   NA
## 4                                                   NA
## 5                                                   NA
## 6                                                   NA
##   Sample.Characteristic.individual.
## 1                             AMC_2
## 2                             AMC_3
## 3                             AMC_5
## 4                             AMC_6
## 5                             AMC_7
## 6                             AMC_8
##   Sample.Characteristic.Ontology.Term.individual.
## 1                                              NA
## 2                                              NA
## 3                                              NA
## 4                                              NA
## 5                                              NA
## 6                                              NA
##   Sample.Characteristic.organism. Sample.Characteristic.Ontology.Term.organism.
## 1                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 2                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 3                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 4                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 5                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 6                    Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
##   Sample.Characteristic.organism.part.
## 1                                colon
## 2                                colon
## 3                                colon
## 4                                colon
## 5                                colon
## 6                                colon
##   Sample.Characteristic.Ontology.Term.organism.part. Factor.Value.biopsy.site.
## 1      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 2      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 3      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 4      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 5      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
## 6      http://purl.obolibrary.org/obo/UBERON_0001155             primary tumor
##   Factor.Value.Ontology.Term.biopsy.site. Analysed
## 1    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 2    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 3    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 4    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 5    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
## 6    http://www.ebi.ac.uk/efo/EFO_0000616      Yes
rownames(sampleData) <- sampleData$Run
keep <- c("Sample.Characteristic.biopsy.site.", "Sample.Characteristic.individual.")
sampleData <- sampleData[,keep]
colnames(sampleData) <- c("tissueType", "individualID")
sampleData$individualID <- factor(sampleData$individualID)
head(sampleData)
##              tissueType individualID
## SRR975551 primary tumor        AMC_2
## SRR975552 primary tumor        AMC_3
## SRR975553 primary tumor        AMC_5
## SRR975554 primary tumor        AMC_6
## SRR975555 primary tumor        AMC_7
## SRR975556 primary tumor        AMC_8
# Put the columns of the count data in the same order as rows names of the sample mapping, then make sure it worked
rawCounts <- rawCounts[,unique(rownames(sampleData))]
all(colnames(rawCounts) == rownames(sampleData))
## [1] TRUE
# rename the tissue types. Always see to it that your control sample's name should occur first when checked the factors alphabetically
rename_tissues <- function(x){
  x <- switch(as.character(x), "normal"="normal-looking surrounding colonic epithelium", "primary tumor"="primary colorectal cancer",  "colorectal cancer metastatic in the liver"="metastatic colorectal cancer to the liver")
  return(x)
}
sampleData$tissueType <- unlist(lapply(sampleData$tissueType, rename_tissues))

# Order the tissue types so that it is sensible and make sure the control sample is first: normal sample -> primary tumor -> metastatic tumor
sampleData$tissueType <- factor(sampleData$tissueType, levels=c("normal-looking surrounding colonic epithelium", "primary colorectal cancer", "metastatic colorectal cancer to the liver"))

# Create the DEseq2DataSet object
deseq2Data <- DESeqDataSetFromMatrix(countData=rawCounts, colData=sampleData, design= ~ individualID + tissueType)
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]

While it is not strictly necessary, it is good to do some preliminary filtering of the data before running the differential expression analysis. This will reduce the size of the DEseq2DataSet object and speed up the runtime of the algorithm. Here we are performing relatively minor filtering, requiring genes to have more than a sum total of 5 reads of support in all samples.

#this command will give you an idea about how many genes would be filtered upon setting the filtering condition
#here we filter the genes which show expression > 0 in less than five samples
dim(deseq2Data)
## [1] 58735    54
dim(deseq2Data[rowSums(counts(deseq2Data)) > 5, ])
## [1] 32467    54
# Perform pre-filtering of the data
deseq2Data <- deseq2Data[rowSums(counts(deseq2Data)) > 5, ]

The next step is to run the function DEseq() on our DESeq2 data set object. In this step the algorithm will perform the following:

  1. Estimation of size factors

  2. Estimation of dispersion

  3. Negative Binomial GLM fitting and Wald statistic.

deseq2Data <- DESeq(deseq2Data)
## estimating size factors
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
##   Note: levels of factors in the design contain characters other than
##   letters, numbers, '_' and '.'. It is recommended (but not required) to use
##   only letters, numbers, and delimiters '_' or '.', as these are safe characters
##   for column names in R. [This is a message, not a warning or an error]
## fitting model and testing

Finally we can extract the differential expression results with the results() function. When using this function we need to tell DESeq2 what comparison to make. This is only necessary if the design formula is multi-factorial or, as in our case, the variable in the design formula has more than 2 levels. This is done with the contrast parameter which takes a character vector of three elements giving the name of the factor of interest, the numerator (i.e. comparator), and the denominator (i.e. control).

Here we get the output for normal tissue vs primary tumor expression results and view a summary of results.

# Extract differential expression results
# For "tissueType" perform primary vs normal comparison
deseq2Results <- results(deseq2Data, contrast=c("tissueType", "primary colorectal cancer", "normal-looking surrounding colonic epithelium"))
res = as.matrix(deseq2Results)

The ‘res’ object has the following information of the analysis:

  1. baseMean = the average of the normalized counts taken over all samples

  2. log2FoldChange = log2 fold change between the groups. E.g. value 2 means that the expression has increased 4-fold

  3. lfcSE = standard error of the log2FoldChange estimate

  4. stat = Wald statistic

  5. pvalue = Wald test p-value

  6. padj = Benjamini-Hochberg adjusted p-value

Statistically the criteria for a gene to be differentially expressed is:

padj < 0.05 AND log2FoldChange > 1 (for upregulated genes)

padj < 0.05 AND log2FoldChange < -1 (for downregulated genes)

Next we convert the ensembl IDs to gene symbols for fuurther enrichment analysis

ensemblIDs = rownames(res)

# remember to install it if you don't have it already
#This is the database for Human Genome (Find a similar one for other model organisms)
library("org.Hs.eg.db")
## Loading required package: AnnotationDbi
## 
#map the ensembl IDs to gene symbols
symbols <- mapIds(org.Hs.eg.db, keys = ensemblIDs, keytype = "ENSEMBL", column="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
symbols = as.data.frame(symbols)
head(symbols)
##                  symbols
## ENSG00000000003   TSPAN6
## ENSG00000000005     TNMD
## ENSG00000000419     DPM1
## ENSG00000000457    SCYL3
## ENSG00000000460 C1orf112
## ENSG00000000938      FGR
#check if rownames of 'res' object and rownames of 'symbols' object are identical. If not order them alphabetically which makes the concatenation of two objects easy.
identical(rownames(symbols), rownames(res))
## [1] TRUE
res = cbind.data.frame(symbols[,1], res)
colnames(res)[1] = "GeneSymbol"
write.csv(res, file = "DESeqResults.csv")

We will used this data frame for further enrichment analysis using clusterprofiler. clusterProfiler implements methods to analyze and visualize functional profiles of genomic coordinates.