RNA-seq data analysis
RNA-sequencing (RNA-seq) has a wide variety of applications, but no single analysis pipeline can be used in all cases. To review all the necessary tools and procedures please go through the following publication:
Conesa, A., Madrigal, P., Tarazona, S. et al. A survey of best practices for RNA-seq data analysis. Genome Biol 17, 13 (2016). https://doi.org/10.1186/s13059-016-0881-8 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8
The basic workflow for RNA-seq data analysis is:
A caption
The minimum hardware requirment for running this pipeline on a human genome is:
16GB RAM
1 TB free space (This may differ based on the number of samples to be processed).
GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles. GEO requires raw data, processed data and metadata. Raw data facilitates the unambiguous interpretation of the data and potential verification of conclusions. For microarray data, raw data may be supplied either within the Sample record data tables or as external supplementary data files, e.g., Affymetrix CEL. For high-throughput sequencing, GEO brokers the complete set of raw data files, e.g., FASTQ, to the SRA database on your behalf.
One can download the raw data from the SRA database using sratoolkit.
https://github.com/ncbi/sra-tools
The command to download a file from GEO database in a fastq format is:
fastq-dump --split-files SRR10678267
(If the data is generated using a paired-end library)
fastq-dump SRR10678267
(If the data is generated using a single-end library)
Once the data is downloaded you can check the quality of the data using FastQC and perform quality control analysis using cutadapt. The tutorial for both of these tools are found here:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/ (for FastQC)
https://cutadapt.readthedocs.io/en/stable/guide.html (for cutadapt)
Once the data is pre-processed and checked for quality measures, it can be aligned to the reference genome. For alignment and quantification you will require two files:
Reference genome file (fasta format)
Gene annotation file (GTF/GFF format)
When working with the human genome one can either download the hg19 or hg38 version. The most updated version of human genome is hg38 (GRCh38). The reference files can be downloaded from the following links:
https://www.ncbi.nlm.nih.gov/genome/?term=human (from NCBI)
https://www.gencodegenes.org/human/ (from gencode)
https://www.ensembl.org/info/data/ftp/index.html (from ensembl)
Though each of the of these databases have the same sequences for the human genome, the nomenclature of these sequences differ from database to database. For eg: GENCODE uses the UCSC convention of prefixing chromosome names with “chr”, e.g. “chr1” and “chrM”, but Ensembl calls these “1” or “MT”. At the time of writing (Ensembl 89), a few transcripts differ due to conversion issues. In addition, around 160 PAR genes are duplicated in GENCODE but only once in Ensembl. The differences affect fewer than 1% of the transcripts. Apart from gene annotation itself, the links to external databases differ.
To determine where on the human genome our reads originated from, one will align the reads to the reference genome using STAR (Spliced Transcripts Alignment to a Reference). STAR is an aligner designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced alignments. Please refer to the following link to understand the working of the STAR algorithm:
https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html
Before starting with the reads alignment, one needs to index the reference genome using the following command:
STAR --runThreadN n --runMode genomeGenerate --genomeDir dir --genomeFastaFiles reference --sjdbGTFfile gftFile
where n = number of threads/cores (This parameter depends on the number of cores available in your system).
dir = path to the folder where one wishes to store the index files
reference = reference genome file in FASTA format
gtfFile = gene annotation file in GTF/GFF format
Example:
STAR --runThreadN 32 --runMode genomeGenerate --genomeDir ./hg38/ --genomeFastaFiles GRCh38_latest_genomic.fna --sjdbGTFfile GRCh38_latest_genomic.gff
Remember that the indexed files need to be generated only once. This is not a repetitive step.
Once the indexing is done you can move on to the alignment step. The command for alignment is:
STAR --genomeDir genomeDir --runThreadN n --readFilesIn R1.fastq R2.fastq --outFileNamePrefix out --outSAMtype BAM SortedByCoordinate
where n = number of threads/cores (This parameter depends on the number of cores available in your system).
genomeDir = path to the folder where the index files are stored
R1/R2.fastq = input FASTQ files
out = prefix for all output files
outSAMtype = output filetype (BAM+sorted output suggested)
Example:
STAR --genomeDir ./hg38/ --runThreadN 32 --readFilesIn SRR10678267_1.fastq SRR10678267_2.fastq --outFileNamePrefix SRR10678267 --outSAMtype BAM SortedByCoordinate
The output BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited text file that contains sequence alignment data. These formats are described on the SAM Tools web site: http://samtools.github.io/hts-specs/.
It is also a good practice to index the output BAM file using the samtools command.
samtools index BAMFILE
Example:
samtools index SRR10678267Aligned.sortedByCoord.out.bam
Once the alignment is done, the next step is the quantification of the data. You can use bedtools coverage function for quantification of RNA-seq data using the command:
bedtools coverage -a gtfFile -b BAMFILE > out.gtf
where, gftFile = gene annotation file in GTF/GFF format and
BAMFILE = output BAM file from STAR alignment
Example:
bedtools coverage -a GRCh38_latest_genomic.gff -b SRR10678267Aligned.sortedByCoord.out.bam > SRR10678267.gtf
After each interval in A, bedtools coverage will report:
The number of features in B that overlapped (by at least one base pair) the A interval.
The number of bases in A that had non-zero coverage from features in B.
The length of the entry in A.
The fraction of bases in A that had non-zero coverage from features in B.
All these columns will be concatenated to the end of the GTF file for the sample. The 1st output column of bedtools (10th column from the output GTF file) is the gene count column. We require this column for further analysis.
Please remember to provide absolute paths for each of the files used in the commands above.
For identifying the differential expression genes (DEGs) one needs to have minimum of three samples per condition for analysis. You will need to run the above commands for all the samples that need to be included in the analysis.
Once the alignment and quantification process is complete for all the samples, expression matrix needs to be created for further analysis. For each output GTF file we will only extract the protein-coding gene entries using the grep function and extract only the 9th and 10th columns (gene annotation and gene count) using the cut function.
grep -P "\tgene\t" SRR10678267.gtf | grep "protein_coding" | cut -f 9,10 > SRR10678267.txt
You will run the same command for all the GTF files in turn generating the same number of TXT files.
Now to extract the gene names from the TXT file, we again use the cut command.
The first column of the TXT file should have entries like:
ID=gene-OR4F5;Dbxref=GeneID:79501,HGNC:HGNC:14825;Name=OR4F5;description=olfactory receptor family 4 subfamily F member 5;gbkey=Gene;gene=OR4F5;gene_biotype=protein_coding
Here the first attribute is the gene name which we will extract using the following command:
cut -f 1 -d ";" SRR10678267.txt > genes
The -f option allows you to determine which column needs to be extracted. If the gene name information is in the second column, one can use the parameter -f 2 to extract the particular column.
You do not need to do this step for all the files, as the gene entries would be the same for all of them.
The further compilation script needs to be run in the R environment.
Please set the path to the folder where all the files are present.
#set the working directory using the function setwd())
setwd("/home/pranali")
#List all the TXT files in the directory
files = list.files(".", pattern = "*.txt")
#Check that only the sample TXT are selected. If any other file is selected remove it from the files objects.
files
## [1] "SRR10678284.txt" "SRR10678285.txt" "SRR10678286.txt" "SRR10678287.txt"
for(i in 1:length(files)){
#read each of the TXT files
data = read.delim(files[i], header= F)
if(i == 1){
#create a object called mat for storing all the count data
mat = as.data.frame(data[,2])
} else {
#concatenate the second column of each TXT file
mat = cbind.data.frame(mat, data[,2])
}
}
#set the column header as the file names, as each filename represents the sample ID
colnames(mat) = gsub(".txt", "", files)
head(mat)
## SRR10678284 SRR10678285 SRR10678286 SRR10678287
## 1 49 20 86 127
## 2 309 977 1178 475
## 3 96 43 149 71
## 4 163 309 348 168
## 5 432 340 682 342
## 6 1980 2175 4210 2378
genes = read.delim("genes", header = F)
head(genes)
## V1
## 1 ID=gene-OR4F5
## 2 ID=gene-LOC112268260
## 3 ID=gene-OR4F29
## 4 ID=gene-LOC105378947
## 5 ID=gene-OR4F16
## 6 ID=gene-SAMD11
#replace the "ID=gene-" string with nothing to just keep the gene symbols
genes[,1] = gsub("ID=gene-", "", genes[,1])
head(genes)
## V1
## 1 OR4F5
## 2 LOC112268260
## 3 OR4F29
## 4 LOC105378947
## 5 OR4F16
## 6 SAMD11
#save the mat object for the next snippet of averaging over multiple gene counts
mat2 = mat
#assign the gene symbols as row names to the mat object
rownames(mat) = genes[,1]
head(mat)
## SRR10678284 SRR10678285 SRR10678286 SRR10678287
## OR4F5 49 20 86 127
## LOC112268260 309 977 1178 475
## OR4F29 96 43 149 71
## LOC105378947 163 309 348 168
## OR4F16 432 340 682 342
## SAMD11 1980 2175 4210 2378
#write the matrix out to a file
write.csv(mat, file = "count.csv")
If you have duplicate gene entries in your data you can take average of the gene count for the duplicate entries. Please refer to the following snippet:
#concatenate the
mat1 = cbind.data.frame(genes[,1], mat2)
mat1[1:5,1:5]
## genes[, 1] SRR10678284 SRR10678285 SRR10678286 SRR10678287
## 1 OR4F5 49 20 86 127
## 2 LOC112268260 309 977 1178 475
## 3 OR4F29 96 43 149 71
## 4 LOC105378947 163 309 348 168
## 5 OR4F16 432 340 682 342
#hererwe take the average of all the features with the same symbols
mat1 = aggregate(mat1[, 2:ncol(mat1)], list(mat1[,1]), mean)
rownames(mat1) = mat1[,1]
mat1 = mat1[,-1]
head(mat1)
## SRR10678284 SRR10678285 SRR10678286 SRR10678287
## A1BG 289 365 4383 7282
## A1CF 64 220 679 12
## A2M 138 1795 953 135
## A2ML1 292 322 586 186
## A3GALT2 386 294 1076 564
## A4GALT 260 145 332 1622
#save the mat1 object for further analysis.
Once the expression matrix is generated the final step in the differential expression analysis workflow is fitting the raw counts to the NB model and performing the statistical test for differentially expressed genes. In this step we essentially want to determine whether the mean expression levels of different sample groups are significantly different.
For this analysis we will use the RNA-seq data obtained from the EBI Expression Atlas (GXA). Specifically data set E-GEOD-50760 which corresponds to PMID: 25049118. This data consists of 54 samples from 18 individuals. Each individual has a primary colorectal cancer sample, a metastatic liver sample, and a normal sample of the surrounding colonic epithilium. The quantification data required to run differential expression analysis using DEseq2 are raw readcounts for either genes or transcripts. We will use the output count matrix as a starting point.
The count data can be downloaded from: https://www.ebi.ac.uk/gxa/experiments-content/E-GEOD-50760/resources/DifferentialSecondaryDataFiles.RnaSeq/raw-counts
The sample information data can be downloaded from: https://www.ebi.ac.uk/gxa/experiments-content/E-GEOD-50760/resources/ExperimentDesignFile.RnaSeq/experiment-design
#set the working directory using the function setwd())
setwd("/home/pranali")
# Install the latest version of DEseq2
#Remember that you need to install the package only once but load it every time you start a new R session.
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("DESeq2")
# load the library
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
# Read in the raw read counts
rawCounts = read.delim("E-GEOD-50760-raw-counts.tsv")
head(rawCounts)
## Gene.ID Gene.Name SRR975562 SRR975555 SRR975597 SRR975571 SRR975553
## 1 ENSG00000000003 TSPAN6 1990 874 1011 1366 867
## 2 ENSG00000000005 TNMD 12 4 6 11 12
## 3 ENSG00000000419 DPM1 681 978 610 432 293
## 4 ENSG00000000457 SCYL3 165 221 138 309 207
## 5 ENSG00000000460 C1orf112 88 109 88 42 38
## 6 ENSG00000000938 FGR 60 84 81 121 89
## SRR975567 SRR975590 SRR975581 SRR975601 SRR975556 SRR975574 SRR975566
## 1 619 1800 385 1523 754 764 2118
## 2 1 3 2 7 3 9 5
## 3 400 560 157 547 188 182 485
## 4 248 158 146 197 150 158 99
## 5 108 116 27 97 15 35 89
## 6 93 66 47 61 39 61 120
## SRR975554 SRR975589 SRR975598 SRR975586 SRR975564 SRR975558 SRR975563
## 1 1957 669 3570 1145 264 759 998
## 2 13 1 24 1 1 1 2
## 3 646 455 716 353 403 374 372
## 4 119 253 232 144 136 247 299
## 5 119 179 158 41 96 88 45
## 6 65 413 38 43 216 137 55
## SRR975557 SRR975578 SRR975585 SRR975569 SRR975572 SRR975551 SRR975593
## 1 116 594 672 1726 1324 3810 1457
## 2 6 4 5 9 10 45 8
## 3 149 145 283 456 277 1574 759
## 4 82 133 222 330 255 250 277
## 5 21 13 36 72 34 232 85
## 6 137 42 48 67 59 31 322
## SRR975552 SRR975568 SRR975588 SRR975582 SRR975559 SRR975600 SRR975584
## 1 775 1501 1036 688 1202 401 575
## 2 1 2 6 9 3 1 5
## 3 404 669 435 264 588 345 247
## 4 357 174 329 199 97 138 177
## 5 188 92 61 34 94 81 39
## 6 59 53 52 58 66 342 56
## SRR975595 SRR975579 SRR975577 SRR975591 SRR975592 SRR975583 SRR975599
## 1 1660 191 774 992 736 1019 411
## 2 10 7 13 2 4 3 8
## 3 477 116 206 643 479 244 223
## 4 131 87 214 187 100 223 62
## 5 56 13 30 92 100 52 42
## 6 60 33 50 372 112 43 66
## SRR975603 SRR975575 SRR975596 SRR975565 SRR975604 SRR975594 SRR975560
## 1 974 521 968 2511 1144 566 1065
## 2 0 2 4 14 3 0 3
## 3 729 220 647 777 689 134 384
## 4 207 150 73 248 185 89 133
## 5 141 38 53 126 50 25 37
## 6 346 53 93 48 378 74 66
## SRR975580 SRR975570 SRR975602 SRR975561 SRR975573 SRR975587 SRR975576
## 1 1596 394 1451 804 571 1825 783
## 2 17 1 0 4 13 2 2
## 3 444 157 560 301 229 717 192
## 4 178 133 85 147 182 165 146
## 5 66 20 84 21 37 138 19
## 6 59 23 38 54 44 124 46
# Read in the sample mappings
sampleData = read.delim("E-GEOD-50760-experiment-design.tsv")
head(sampleData)
## Run Sample.Characteristic.biopsy.site.
## 1 SRR975551 primary tumor
## 2 SRR975552 primary tumor
## 3 SRR975553 primary tumor
## 4 SRR975554 primary tumor
## 5 SRR975555 primary tumor
## 6 SRR975556 primary tumor
## Sample.Characteristic.Ontology.Term.biopsy.site.
## 1 http://www.ebi.ac.uk/efo/EFO_0000616
## 2 http://www.ebi.ac.uk/efo/EFO_0000616
## 3 http://www.ebi.ac.uk/efo/EFO_0000616
## 4 http://www.ebi.ac.uk/efo/EFO_0000616
## 5 http://www.ebi.ac.uk/efo/EFO_0000616
## 6 http://www.ebi.ac.uk/efo/EFO_0000616
## Sample.Characteristic.disease. Sample.Characteristic.Ontology.Term.disease.
## 1 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 2 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 3 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 4 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 5 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 6 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## Sample.Characteristic.disease.staging.
## 1 Stage IV Colorectal Cancer
## 2 Stage IV Colorectal Cancer
## 3 Stage IV Colorectal Cancer
## 4 Stage IV Colorectal Cancer
## 5 Stage IV Colorectal Cancer
## 6 Stage IV Colorectal Cancer
## Sample.Characteristic.Ontology.Term.disease.staging.
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## Sample.Characteristic.individual.
## 1 AMC_2
## 2 AMC_3
## 3 AMC_5
## 4 AMC_6
## 5 AMC_7
## 6 AMC_8
## Sample.Characteristic.Ontology.Term.individual.
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## Sample.Characteristic.organism. Sample.Characteristic.Ontology.Term.organism.
## 1 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 2 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 3 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 4 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 5 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 6 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## Sample.Characteristic.organism.part.
## 1 colon
## 2 colon
## 3 colon
## 4 colon
## 5 colon
## 6 colon
## Sample.Characteristic.Ontology.Term.organism.part. Factor.Value.biopsy.site.
## 1 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 2 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 3 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 4 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 5 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 6 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## Factor.Value.Ontology.Term.biopsy.site. Analysed
## 1 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 2 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 3 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 4 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 5 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 6 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
The first thing we do is coerce the data frame containing the read counts into a format DESeq2 can accept. Specifically this must be a matrix with row names as genomic features (i.e. genes), and column names as samples. Next DESeq2 requires a data frame specifying the mapping of samples to variables, we load this in and clean it up some keeping only the variables we care about and making sure everything is a factor. For DEseq2 to work properly the column names of the count matrix must be in the same order as the row names of the sample mapping data, to ensure this we re-order the column names of the count data and run a check to ensure this has occurred correctly. To take advantage of the default settings of DEseq2 the control of the variable of interest, in our case the tissue type, should be the first element in the levels of that variable. Because we have more than 2 conditions for this variable we will not be taking advantage of the default settings however it’s good to get into the practice of doing this so we do it here. We then create a DEseq2DataSet object with this information and supply a formula where we use the individual id as a blocking factor and tissue type as the comparison variable.
# Convert count data to a matrix of appropriate form that DEseq2 can read
geneID <- rawCounts$Gene.ID
sampleIndex <- grepl("SRR\\d+", colnames(rawCounts))
rawCounts <- as.matrix(rawCounts[,sampleIndex])
rownames(rawCounts) <- geneID
head(rawCounts)
## SRR975562 SRR975555 SRR975597 SRR975571 SRR975553 SRR975567
## ENSG00000000003 1990 874 1011 1366 867 619
## ENSG00000000005 12 4 6 11 12 1
## ENSG00000000419 681 978 610 432 293 400
## ENSG00000000457 165 221 138 309 207 248
## ENSG00000000460 88 109 88 42 38 108
## ENSG00000000938 60 84 81 121 89 93
## SRR975590 SRR975581 SRR975601 SRR975556 SRR975574 SRR975566
## ENSG00000000003 1800 385 1523 754 764 2118
## ENSG00000000005 3 2 7 3 9 5
## ENSG00000000419 560 157 547 188 182 485
## ENSG00000000457 158 146 197 150 158 99
## ENSG00000000460 116 27 97 15 35 89
## ENSG00000000938 66 47 61 39 61 120
## SRR975554 SRR975589 SRR975598 SRR975586 SRR975564 SRR975558
## ENSG00000000003 1957 669 3570 1145 264 759
## ENSG00000000005 13 1 24 1 1 1
## ENSG00000000419 646 455 716 353 403 374
## ENSG00000000457 119 253 232 144 136 247
## ENSG00000000460 119 179 158 41 96 88
## ENSG00000000938 65 413 38 43 216 137
## SRR975563 SRR975557 SRR975578 SRR975585 SRR975569 SRR975572
## ENSG00000000003 998 116 594 672 1726 1324
## ENSG00000000005 2 6 4 5 9 10
## ENSG00000000419 372 149 145 283 456 277
## ENSG00000000457 299 82 133 222 330 255
## ENSG00000000460 45 21 13 36 72 34
## ENSG00000000938 55 137 42 48 67 59
## SRR975551 SRR975593 SRR975552 SRR975568 SRR975588 SRR975582
## ENSG00000000003 3810 1457 775 1501 1036 688
## ENSG00000000005 45 8 1 2 6 9
## ENSG00000000419 1574 759 404 669 435 264
## ENSG00000000457 250 277 357 174 329 199
## ENSG00000000460 232 85 188 92 61 34
## ENSG00000000938 31 322 59 53 52 58
## SRR975559 SRR975600 SRR975584 SRR975595 SRR975579 SRR975577
## ENSG00000000003 1202 401 575 1660 191 774
## ENSG00000000005 3 1 5 10 7 13
## ENSG00000000419 588 345 247 477 116 206
## ENSG00000000457 97 138 177 131 87 214
## ENSG00000000460 94 81 39 56 13 30
## ENSG00000000938 66 342 56 60 33 50
## SRR975591 SRR975592 SRR975583 SRR975599 SRR975603 SRR975575
## ENSG00000000003 992 736 1019 411 974 521
## ENSG00000000005 2 4 3 8 0 2
## ENSG00000000419 643 479 244 223 729 220
## ENSG00000000457 187 100 223 62 207 150
## ENSG00000000460 92 100 52 42 141 38
## ENSG00000000938 372 112 43 66 346 53
## SRR975596 SRR975565 SRR975604 SRR975594 SRR975560 SRR975580
## ENSG00000000003 968 2511 1144 566 1065 1596
## ENSG00000000005 4 14 3 0 3 17
## ENSG00000000419 647 777 689 134 384 444
## ENSG00000000457 73 248 185 89 133 178
## ENSG00000000460 53 126 50 25 37 66
## ENSG00000000938 93 48 378 74 66 59
## SRR975570 SRR975602 SRR975561 SRR975573 SRR975587 SRR975576
## ENSG00000000003 394 1451 804 571 1825 783
## ENSG00000000005 1 0 4 13 2 2
## ENSG00000000419 157 560 301 229 717 192
## ENSG00000000457 133 85 147 182 165 146
## ENSG00000000460 20 84 21 37 138 19
## ENSG00000000938 23 38 54 44 124 46
# Convert sample variable mappings to an appropriate form that DESeq2 can read
head(sampleData)
## Run Sample.Characteristic.biopsy.site.
## 1 SRR975551 primary tumor
## 2 SRR975552 primary tumor
## 3 SRR975553 primary tumor
## 4 SRR975554 primary tumor
## 5 SRR975555 primary tumor
## 6 SRR975556 primary tumor
## Sample.Characteristic.Ontology.Term.biopsy.site.
## 1 http://www.ebi.ac.uk/efo/EFO_0000616
## 2 http://www.ebi.ac.uk/efo/EFO_0000616
## 3 http://www.ebi.ac.uk/efo/EFO_0000616
## 4 http://www.ebi.ac.uk/efo/EFO_0000616
## 5 http://www.ebi.ac.uk/efo/EFO_0000616
## 6 http://www.ebi.ac.uk/efo/EFO_0000616
## Sample.Characteristic.disease. Sample.Characteristic.Ontology.Term.disease.
## 1 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 2 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 3 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 4 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 5 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## 6 colorectal cancer http://www.ebi.ac.uk/efo/EFO_0005842
## Sample.Characteristic.disease.staging.
## 1 Stage IV Colorectal Cancer
## 2 Stage IV Colorectal Cancer
## 3 Stage IV Colorectal Cancer
## 4 Stage IV Colorectal Cancer
## 5 Stage IV Colorectal Cancer
## 6 Stage IV Colorectal Cancer
## Sample.Characteristic.Ontology.Term.disease.staging.
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## Sample.Characteristic.individual.
## 1 AMC_2
## 2 AMC_3
## 3 AMC_5
## 4 AMC_6
## 5 AMC_7
## 6 AMC_8
## Sample.Characteristic.Ontology.Term.individual.
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## Sample.Characteristic.organism. Sample.Characteristic.Ontology.Term.organism.
## 1 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 2 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 3 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 4 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 5 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## 6 Homo sapiens http://purl.obolibrary.org/obo/NCBITaxon_9606
## Sample.Characteristic.organism.part.
## 1 colon
## 2 colon
## 3 colon
## 4 colon
## 5 colon
## 6 colon
## Sample.Characteristic.Ontology.Term.organism.part. Factor.Value.biopsy.site.
## 1 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 2 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 3 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 4 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 5 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## 6 http://purl.obolibrary.org/obo/UBERON_0001155 primary tumor
## Factor.Value.Ontology.Term.biopsy.site. Analysed
## 1 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 2 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 3 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 4 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 5 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
## 6 http://www.ebi.ac.uk/efo/EFO_0000616 Yes
rownames(sampleData) <- sampleData$Run
keep <- c("Sample.Characteristic.biopsy.site.", "Sample.Characteristic.individual.")
sampleData <- sampleData[,keep]
colnames(sampleData) <- c("tissueType", "individualID")
sampleData$individualID <- factor(sampleData$individualID)
head(sampleData)
## tissueType individualID
## SRR975551 primary tumor AMC_2
## SRR975552 primary tumor AMC_3
## SRR975553 primary tumor AMC_5
## SRR975554 primary tumor AMC_6
## SRR975555 primary tumor AMC_7
## SRR975556 primary tumor AMC_8
# Put the columns of the count data in the same order as rows names of the sample mapping, then make sure it worked
rawCounts <- rawCounts[,unique(rownames(sampleData))]
all(colnames(rawCounts) == rownames(sampleData))
## [1] TRUE
# rename the tissue types. Always see to it that your control sample's name should occur first when checked the factors alphabetically
rename_tissues <- function(x){
x <- switch(as.character(x), "normal"="normal-looking surrounding colonic epithelium", "primary tumor"="primary colorectal cancer", "colorectal cancer metastatic in the liver"="metastatic colorectal cancer to the liver")
return(x)
}
sampleData$tissueType <- unlist(lapply(sampleData$tissueType, rename_tissues))
# Order the tissue types so that it is sensible and make sure the control sample is first: normal sample -> primary tumor -> metastatic tumor
sampleData$tissueType <- factor(sampleData$tissueType, levels=c("normal-looking surrounding colonic epithelium", "primary colorectal cancer", "metastatic colorectal cancer to the liver"))
# Create the DEseq2DataSet object
deseq2Data <- DESeqDataSetFromMatrix(countData=rawCounts, colData=sampleData, design= ~ individualID + tissueType)
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
While it is not strictly necessary, it is good to do some preliminary filtering of the data before running the differential expression analysis. This will reduce the size of the DEseq2DataSet object and speed up the runtime of the algorithm. Here we are performing relatively minor filtering, requiring genes to have more than a sum total of 5 reads of support in all samples.
#this command will give you an idea about how many genes would be filtered upon setting the filtering condition
#here we filter the genes which show expression > 0 in less than five samples
dim(deseq2Data)
## [1] 58735 54
dim(deseq2Data[rowSums(counts(deseq2Data)) > 5, ])
## [1] 32467 54
# Perform pre-filtering of the data
deseq2Data <- deseq2Data[rowSums(counts(deseq2Data)) > 5, ]
The next step is to run the function DEseq() on our DESeq2 data set object. In this step the algorithm will perform the following:
Estimation of size factors
Estimation of dispersion
Negative Binomial GLM fitting and Wald statistic.
deseq2Data <- DESeq(deseq2Data)
## estimating size factors
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## final dispersion estimates
## Note: levels of factors in the design contain characters other than
## letters, numbers, '_' and '.'. It is recommended (but not required) to use
## only letters, numbers, and delimiters '_' or '.', as these are safe characters
## for column names in R. [This is a message, not a warning or an error]
## fitting model and testing
Finally we can extract the differential expression results with the results() function. When using this function we need to tell DESeq2 what comparison to make. This is only necessary if the design formula is multi-factorial or, as in our case, the variable in the design formula has more than 2 levels. This is done with the contrast parameter which takes a character vector of three elements giving the name of the factor of interest, the numerator (i.e. comparator), and the denominator (i.e. control).
Here we get the output for normal tissue vs primary tumor expression results and view a summary of results.
# Extract differential expression results
# For "tissueType" perform primary vs normal comparison
deseq2Results <- results(deseq2Data, contrast=c("tissueType", "primary colorectal cancer", "normal-looking surrounding colonic epithelium"))
res = as.matrix(deseq2Results)
The ‘res’ object has the following information of the analysis:
baseMean = the average of the normalized counts taken over all samples
log2FoldChange = log2 fold change between the groups. E.g. value 2 means that the expression has increased 4-fold
lfcSE = standard error of the log2FoldChange estimate
stat = Wald statistic
pvalue = Wald test p-value
padj = Benjamini-Hochberg adjusted p-value
Statistically the criteria for a gene to be differentially expressed is:
padj < 0.05 AND log2FoldChange > 1 (for upregulated genes)
padj < 0.05 AND log2FoldChange < -1 (for downregulated genes)
Next we convert the ensembl IDs to gene symbols for fuurther enrichment analysis
ensemblIDs = rownames(res)
# remember to install it if you don't have it already
#This is the database for Human Genome (Find a similar one for other model organisms)
library("org.Hs.eg.db")
## Loading required package: AnnotationDbi
##
#map the ensembl IDs to gene symbols
symbols <- mapIds(org.Hs.eg.db, keys = ensemblIDs, keytype = "ENSEMBL", column="SYMBOL")
## 'select()' returned 1:many mapping between keys and columns
symbols = as.data.frame(symbols)
head(symbols)
## symbols
## ENSG00000000003 TSPAN6
## ENSG00000000005 TNMD
## ENSG00000000419 DPM1
## ENSG00000000457 SCYL3
## ENSG00000000460 C1orf112
## ENSG00000000938 FGR
#check if rownames of 'res' object and rownames of 'symbols' object are identical. If not order them alphabetically which makes the concatenation of two objects easy.
identical(rownames(symbols), rownames(res))
## [1] TRUE
res = cbind.data.frame(symbols[,1], res)
colnames(res)[1] = "GeneSymbol"
write.csv(res, file = "DESeqResults.csv")
We will used this data frame for further enrichment analysis using clusterprofiler. clusterProfiler implements methods to analyze and visualize functional profiles of genomic coordinates.