This material has been created using the following resources:
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf [@Lun2016]
http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/99-RNAseq_DE_analysis_with_R.html
Packages used: limma, edgeR, gplots, org.Mm.eg.db, RColorBrewer, Glimma
There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analysis on in R. While mapping and counting are important and necessary tasks (using command line tools and packages), today we will be starting from the count data and getting into analysis.
Set up an RStudio project specifying the directory where you have saved the /data
directory.
First, let’s load all the packages we will need to analyse the data.
library(edgeR)
library(DESeq2)
library(limma)
library(Glimma)
library(gplots)
library(org.Mm.eg.db)
library(RColorBrewer)
library(GEOquery)
library(tibble)
library(tidyverse)
The data for this tutorial comes from a paper, Next Generation Sequencing Quantitative Analysis of Srf-regulated Transcriptomes in mouse cardiomyocytes during cardiomyocyte maturation . Both the raw data (sequence reads) and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE116030.
This study examines the expression profiles of wild type neonatal cardiomyocytes and SRF-overexpressed neonatal cardiomyocytes in mouse. Two groups are present and each group contains three biological replicates.
The sampleinfo file contains basic information about the samples that we will need for the analysis today.
# Read the sample information into R
# Next Generation Sequencing Quantitative Analysis of Srf-regulated Transcriptomes in mouse cardiomyocytes during cardiomyocyte maturation.
gset <- getGEO("GSE116030", GSEMatrix =TRUE, getGPL=FALSE)
## Found 1 file(s)
## GSE116030_series_matrix.txt.gz
## Parsed with column specification:
## cols(
## ID_REF = col_character(),
## GSM3207724 = col_character(),
## GSM3207725 = col_character(),
## GSM3207726 = col_character(),
## GSM3207727 = col_character(),
## GSM3207728 = col_character(),
## GSM3207729 = col_character()
## )
sampleinfo = pData(gset[[1]])[, 1:3]
sampleinfo$status = factor(c(rep('OE',3), rep('GFP', 3)))
sampleinfo = as.data.frame(sampleinfo)
We will first use the counts file as a starting point for our analysis. This data has already been aligned to the mouse genome. Let’s take a look at the data.
# Read the data into R
seqdata = read.table('~/Desktop/lab/Shulin/data/published/GSE116030_SrfOE.txt', header = T, sep = '\t')
head(seqdata)
## Chr OE1 OE2 OE3 GFP1 GFP2 GFP3
## 1 Xkr4 3634 4 10 0 3 3
## 2 Rp1 9747 41 24 28 93 38
## 3 Sox17 3130 196 91 170 90 47
## 4 Mrpl15 4201 1395 1924 2057 2241 1410
## 5 Lypla1 2433 1732 2524 2855 3149 1350
## 6 Tcea1 2847 286 494 368 263 148
dim(seqdata)
## [1] 23337 7
Things become a little complicated data for RNA-sequencing data due to its enormous size (~20G per fastq file). Additionally, the method for accessing array data can’t be used in RNA-sequencing data directly (see the next example). Here we need to download the data directly from the GEO website and load it in R.
The seqdata
object contains information about genes (one gene per row), the first column has the gene symbol and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample.
We will be reformating the counts matrix into a suitable format for downstream analysis. The first column in the seqdata
dataframe contain annotation information. We need to make a new matrix containing only the counts, but we can store the gene identifiers (the GeneID
column) as rownames. We will add more annotation information about each gene later.
Let’s create a new data object, countdata
, that contains only the counts for the 6 samples.
# Remove first two columns from seqdata
countdata <- seqdata[,-1]
# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]
head(countdata)
## OE1 OE2 OE3 GFP1 GFP2 GFP3
## Xkr4 3634 4 10 0 3 3
## Rp1 9747 41 24 28 93 38
## Sox17 3130 196 91 170 90 47
## Mrpl15 4201 1395 1924 2057 2241 1410
## Lypla1 2433 1732 2524 2855 3149 1350
## Tcea1 2847 286 494 368 263 148
Now take a look at the column names
colnames(countdata)
## [1] "OE1" "OE2" "OE3" "GFP1" "GFP2" "GFP3"
These are the sample names which are pretty short and direct so we don’t need to modify the names. (For more complicated names, we usually need to shorten these to contain only the relevant information about each sample for later indexing use. We will use the substr
command to extract the needed characters and use these as the colnames.)
Note that the column names are now the same as title
in the sampleinfo
file. This is good because it means our sample information in sampleinfo
is in the same order as the columns in countdata
.
table(colnames(countdata)==sampleinfo$title)
##
## TRUE
## 6
Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.
There are a few ways to filter out lowly expressed genes. When there are biological replicates in each group, in our case there is a sample size of 3 in each group, we prefer filtering on a minimum counts per million threshold present in at least 3 samples. Three represents the smallest sample size for each group in our experiment. In this dataset, we choose to retain genes if they are expressed at a counts-per-million (CPM) above 0.5 in at least three samples.
We’ll use the cpm
function from the edgeR library [@robinson2010edgeR] to generate the CPM values and then filter.
# Obtain CPMs
myCPM <- cpm(countdata)
# Have a look at the output
head(myCPM)
## OE1 OE2 OE3 GFP1 GFP2 GFP3
## Xkr4 55.85028 0.2511356 0.4695154 0.000000 0.1295357 0.2122026
## Rp1 149.79984 2.5741395 1.1268368 1.347108 4.0156080 2.6878996
## Sox17 48.10439 12.3056423 4.2725897 8.178869 3.8860723 3.3245074
## Mrpl15 64.56439 87.5835253 90.3347541 98.964310 96.7631997 99.7352207
## Lypla1 37.39233 108.7416960 118.5056754 137.356881 135.9693512 95.4911687
## Tcea1 43.75502 17.9561923 23.1940585 17.704845 11.3559668 10.4686615
# Which values in myCPM are greater than 0.5?
thresh <- myCPM > 0.5
# This produces a logical matrix with TRUEs and FALSEs
head(thresh)
## OE1 OE2 OE3 GFP1 GFP2 GFP3
## Xkr4 TRUE FALSE FALSE FALSE FALSE FALSE
## Rp1 TRUE TRUE TRUE TRUE TRUE TRUE
## Sox17 TRUE TRUE TRUE TRUE TRUE TRUE
## Mrpl15 TRUE TRUE TRUE TRUE TRUE TRUE
## Lypla1 TRUE TRUE TRUE TRUE TRUE TRUE
## Tcea1 TRUE TRUE TRUE TRUE TRUE TRUE
# Summary of how many TRUEs there are in each row
# There are 11433 genes that have TRUEs in all 12 samples.
table(rowSums(thresh))
##
## 0 1 2 3 4 5 6
## 21 8733 898 531 531 475 12148
# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2
# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- countdata[keep,]
summary(keep)
## Mode FALSE TRUE
## logical 8754 14583
dim(counts.keep)
## [1] 14583 6
A CPM of 0.5 is used as it corresponds to a count of 10-15 for the library sizes in this data set. If the count is any smaller, it is considered to be very low, indicating that the associated gene is not expressed in that sample.
# Let's have a look and see whether our threshold of 0.5 does indeed correspond to a count of about 10-15
# We will look at the first sample
plot(myCPM[,1],countdata[,1])
# Let us limit the x and y-axis so we can actually look to see what is happening at the smaller counts
plot(myCPM[,1],countdata[,1],ylim=c(0,50),xlim=c(0,3))
# Add a vertical line at 0.5 CPM
abline(v=0.5)
Next we’ll create a DGEList
object. This is an object used by edgeR to store count data. It has a number of slots for storing various parameters about the data.
dgeObj <- DGEList(counts.keep)
# have a look at dgeObj
dgeObj
## An object of class "DGEList"
## $counts
## OE1 OE2 OE3 GFP1 GFP2 GFP3
## Rp1 9747 41 24 28 93 38
## Sox17 3130 196 91 170 90 47
## Mrpl15 4201 1395 1924 2057 2241 1410
## Lypla1 2433 1732 2524 2855 3149 1350
## Tcea1 2847 286 494 368 263 148
## 14578 more rows ...
##
## $samples
## group lib.size norm.factors
## OE1 1 50280843 1
## OE2 1 15915232 1
## OE3 1 21287077 1
## GFP1 1 20776498 1
## GFP2 1 23149145 1
## GFP3 1 14128030 1
# See what slots are stored in dgeObj
names(dgeObj)
## [1] "counts" "samples"
# Library size information is stored in the samples slot
dgeObj$samples
## group lib.size norm.factors
## OE1 1 50280843 1
## OE2 1 15915232 1
## OE3 1 21287077 1
## GFP1 1 20776498 1
## GFP2 1 23149145 1
## GFP3 1 14128030 1
Now that we have got rid of the lowly expressed genes and have our counts stored in a DGEList
object, we can look at a few different plots to check that the data is good quality, and that the samples are as we would expect.
First, we can check how many reads we have for each sample in the dgeObj
.
dgeObj$samples$lib.size
## [1] 50280843 15915232 21287077 20776498 23149145 14128030
We can also plot the library sizes as a barplot to see whether there are any major discrepanies between the samples more easily.
# The names argument tells the barplot to use the sample names on the x-axis
# The las argument rotates the axis names
barplot(dgeObj$samples$lib.size, names=colnames(dgeObj), las=2)
# Add a title to the plot
title("Barplot of library sizes")
Count data is not normally distributed, so if we want to examine the distributions of the raw counts we need to log the counts. Next we’ll use box plots to check the distribution of the read counts on the log2 scale. We can use the cpm
function to get log2 counts per million, which are corrected for the different library sizes. The cpm
function also adds a small offset to avoid taking log of zero.
# Get log2 counts per million
logcounts <- cpm(dgeObj,log=TRUE)
# Check distributions of samples using boxplots
boxplot(logcounts, xlab="", ylab="Log2 counts per million",las=2)
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcounts),col="blue")
title("Boxplots of logCPMs (unnormalised)")
From the boxplots we see that overall the density distributions of raw log-intensities are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further. And in this case, we doubt that sample OE1 may indicate outlier sample.
An MDSplot is a visualisation of a principle components analysis. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. We can use the plotMDS
function to create the MDS plot.
# Redo the MDSplot with corrected information
col.cell <- c("purple","orange")[sampleinfo$status]
plotMDS(dgeObj,col=col.cell)
legend("topright",fill=c("purple","orange"),legend=levels(sampleinfo$status), pos)
title("Cell type")
Another alternative is to generate an interactive MDS plot using the Glimma package. This allows the user to interactively explore the different dimensions.
labels <- paste(sampleinfo$title, sampleinfo$status)
group <- sampleinfo$status
glMDSPlot(dgeObj, labels=labels, groups=group, folder="mds")
Glimma was created to make interactive versions of some of the popular plots from the limma package. At present it can be used to obtain MDS plots and mean-difference (MD) plots, which will be covered later. The output of glMDSPlot
is an html page (/mds/MDS-Plot.html) that shows the MDS plot on the left, and the amount of variation explained by each dimension in a barplot on the right. The user can hover over points to find out sample information, and switch between successive dimensions in the MDS plot by clicking on the bars in the barplot. The default MDS plots shows dimensions 1 and 2.
An alternative approach to plotMDS
for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. We can do this using the heatmap.2
function from the gplots package. In this example heatmap.2
calculates a matrix of euclidean distances from the logCPM (logcounts
object) for the 500 most variable genes. (Note this has more complicated code than plotting principle components using plotMDS
.)
Let’s select data for the 500 most variable genes and plot the heatmap
# We estimate the variance for each row in the logcounts matrix
var_genes <- apply(logcounts, 1, var)
head(var_genes)
## Rp1 Sox17 Mrpl15 Lypla1 Tcea1 Atp6v1h
## 7.37494829 2.48666024 0.01090041 0.31523921 0.76800042 0.09455902
# Get the gene names for the top 500 most variable genes
select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
head(select_var)
## [1] "Ddx3y" "Eif2s3y" "Uty" "Kdm5d" "Xist" "Bmp10"
# Subset logcounts matrix
highly_variable_lcpm <- logcounts[select_var,]
dim(highly_variable_lcpm)
## [1] 500 6
head(highly_variable_lcpm)
## OE1 OE2 OE3 GFP1 GFP2 GFP3
## Ddx3y 6.529260 -3.600278 6.6714104 6.747055 -3.600278 6.492581
## Eif2s3y 5.141785 -3.600278 6.5333585 6.744467 -3.600278 6.630892
## Uty 6.701250 -3.600278 4.7822478 5.063393 -3.600278 5.413526
## Kdm5d 6.766745 -3.600278 4.3388182 4.105778 -3.600278 4.379993
## Xist 8.477519 7.321857 -0.1826881 -1.096950 6.387398 -0.474964
## Bmp10 4.715637 5.396353 -3.6002782 2.044563 5.038692 -3.600278
## Get some nicer colours
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
# Set up colour vector for celltype variable
col.cell <- c("purple","orange")[sampleinfo$status]
# Plot the heatmap
heatmap.2(highly_variable_lcpm,
col=rev(morecols(50)),
trace="column",
main="Top 500 most variable genes across samples",
ColSideColors=col.cell,scale="row")
# Save the heatmap
png(file="High_var_genes.heatmap.png")
heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes\nacross samples",ColSideColors=col.cell,scale="row")
dev.off()
## png
## 2
Now let’s move to differentially expressed genes analysis
samples = sampleinfo[, c(1,3)]
row.names(samples) = sampleinfo$title
expr_srf = seqdata[keep,2:7]
rownames(expr_srf) = seqdata[keep,1]
design <- as.formula(~status)
design <- model.matrix(design, data = samples)
ddsObj <- DESeqDataSetFromMatrix(countData = expr_srf,
colData = samples,
design = design)
vstcounts <- vst(ddsObj, blind=TRUE)
#plotPCA(vstcounts, intgroup="status")
ddsObj <- estimateSizeFactors(ddsObj)
ddsObj <- estimateDispersions(ddsObj)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
ddsObj <- nbinomWaldTest(object = ddsObj,modelMatrix = design)
res <- results(ddsObj, alpha=0.05)
head(res)
## log2 fold change (MLE): statusOE
## Wald test p-value: statusOE
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## Rp1 284.427371843602 2.6549425449046 1.31661945144577
## Sox17 211.513148148464 0.9706681146393 0.714084763483921
## Mrpl15 2223.39610545563 -0.791015111594018 0.585443552013981
## Lypla1 2716.32563223496 -0.845126773895334 0.771272650615312
## Tcea1 418.813850191803 0.326381443546781 0.388814545068391
## Atp6v1h 1219.9405154697 -0.274435044471403 0.69353394773004
## stat pvalue padj
## <numeric> <numeric> <numeric>
## Rp1 2.01648436986802 NA NA
## Sox17 1.3593177788916 0.174045912645067 0.608497331931411
## Mrpl15 -1.35113813940362 0.176651186109257 0.609108424977331
## Lypla1 -1.09575618067243 0.27318548947567 0.668674721365126
## Tcea1 0.839427042240334 0.401229714145713 0.70617140501417
## Atp6v1h -0.395705279272396 0.692322460451111 0.834780863706197
topGenesPvV <- as.data.frame(res) %>%
rownames_to_column("GeneID") %>%
arrange(padj) %>%
head(1000)
topGenesPvV[which(topGenesPvV$padj<=0.05),1]
## [1] "Col27a1" "Col7a1" "Cd74" "Cnr1"
Usually we will perform the trimmed mean of M-values normalization method (TMM) to eliminate composition biases between libraries. Due to the time limit, please refer to other resources.
save(group,dgeObj,sampleinfo,file="~/Desktop/R_projects/data/preprocessing.Rdata")