RNA-seq analysis in R

Resources

This material has been created using the following resources:
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf [@Lun2016]
http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/99-RNAseq_DE_analysis_with_R.html

Packages used: limma, edgeR, gplots, org.Mm.eg.db, RColorBrewer, Glimma

Overview

Reading in table of counts
Filtering lowly expressed genes
Quality control
Normalisation for composition bias
Differentially expressed gene analysis

Introduction

There are many steps involved in analysing an RNA-Seq experiment. Analysing an RNAseq experiment begins with sequencing reads. These are aligned to a reference genome, then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analysis on in R. While mapping and counting are important and necessary tasks (using command line tools and packages), today we will be starting from the count data and getting into analysis.

Data import

Set up an RStudio project specifying the directory where you have saved the /data directory.

First, let’s load all the packages we will need to analyse the data.

library(edgeR)
library(DESeq2)
library(limma)
library(Glimma)
library(gplots)
library(org.Mm.eg.db)
library(RColorBrewer)
library(GEOquery)
library(tibble)
library(tidyverse)

Mouse mammary gland dataset

The data for this tutorial comes from a paper, Next Generation Sequencing Quantitative Analysis of Srf-regulated Transcriptomes in mouse cardiomyocytes during cardiomyocyte maturation . Both the raw data (sequence reads) and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE116030.

This study examines the expression profiles of wild type neonatal cardiomyocytes and SRF-overexpressed neonatal cardiomyocytes in mouse. Two groups are present and each group contains three biological replicates.

The sampleinfo file contains basic information about the samples that we will need for the analysis today.

# Read the sample information into R
# Next Generation Sequencing Quantitative Analysis of Srf-regulated Transcriptomes in mouse cardiomyocytes during cardiomyocyte maturation.
gset <- getGEO("GSE116030", GSEMatrix =TRUE, getGPL=FALSE)

## Found 1 file(s)

## GSE116030_series_matrix.txt.gz

## Parsed with column specification:
## cols(
##   ID_REF = col_character(),
##   GSM3207724 = col_character(),
##   GSM3207725 = col_character(),
##   GSM3207726 = col_character(),
##   GSM3207727 = col_character(),
##   GSM3207728 = col_character(),
##   GSM3207729 = col_character()
## )

sampleinfo = pData(gset[[1]])[, 1:3]
sampleinfo$status = factor(c(rep('OE',3), rep('GFP', 3)))
sampleinfo = as.data.frame(sampleinfo)

Reading in the count data

We will first use the counts file as a starting point for our analysis. This data has already been aligned to the mouse genome. Let’s take a look at the data.

# Read the data into R
seqdata = read.table('~/Desktop/lab/Shulin/data/published/GSE116030_SrfOE.txt', header = T, sep = '\t')
head(seqdata)

##      Chr  OE1  OE2  OE3 GFP1 GFP2 GFP3
## 1   Xkr4 3634    4   10    0    3    3
## 2    Rp1 9747   41   24   28   93   38
## 3  Sox17 3130  196   91  170   90   47
## 4 Mrpl15 4201 1395 1924 2057 2241 1410
## 5 Lypla1 2433 1732 2524 2855 3149 1350
## 6  Tcea1 2847  286  494  368  263  148

dim(seqdata)

## [1] 23337     7

Things become a little complicated data for RNA-sequencing data due to its enormous size (~20G per fastq file). Additionally, the method for accessing array data can’t be used in RNA-sequencing data directly (see the next example). Here we need to download the data directly from the GEO website and load it in R.

The seqdata object contains information about genes (one gene per row), the first column has the gene symbol and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample.

Format the data

We will be reformating the counts matrix into a suitable format for downstream analysis. The first column in the seqdata dataframe contain annotation information. We need to make a new matrix containing only the counts, but we can store the gene identifiers (the GeneID column) as rownames. We will add more annotation information about each gene later.

Let’s create a new data object, countdata, that contains only the counts for the 6 samples.

# Remove first two columns from seqdata
countdata <- seqdata[,-1]

# Store EntrezGeneID as rownames
rownames(countdata) <- seqdata[,1]

head(countdata)

##         OE1  OE2  OE3 GFP1 GFP2 GFP3
## Xkr4   3634    4   10    0    3    3
## Rp1    9747   41   24   28   93   38
## Sox17  3130  196   91  170   90   47
## Mrpl15 4201 1395 1924 2057 2241 1410
## Lypla1 2433 1732 2524 2855 3149 1350
## Tcea1  2847  286  494  368  263  148

Now take a look at the column names

colnames(countdata)

## [1] "OE1"  "OE2"  "OE3"  "GFP1" "GFP2" "GFP3"

These are the sample names which are pretty short and direct so we don’t need to modify the names. (For more complicated names, we usually need to shorten these to contain only the relevant information about each sample for later indexing use. We will use the substr command to extract the needed characters and use these as the colnames.)

Note that the column names are now the same as title in the sampleinfo file. This is good because it means our sample information in sampleinfo is in the same order as the columns in countdata.

table(colnames(countdata)==sampleinfo$title)

## 
## TRUE 
##    6

Filtering to remove lowly expressed genes

Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.

There are a few ways to filter out lowly expressed genes. When there are biological replicates in each group, in our case there is a sample size of 3 in each group, we prefer filtering on a minimum counts per million threshold present in at least 3 samples. Three represents the smallest sample size for each group in our experiment. In this dataset, we choose to retain genes if they are expressed at a counts-per-million (CPM) above 0.5 in at least three samples.

We’ll use the cpm function from the edgeR library [@robinson2010edgeR] to generate the CPM values and then filter.

# Obtain CPMs
myCPM <- cpm(countdata)
# Have a look at the output
head(myCPM)

##              OE1         OE2         OE3       GFP1        GFP2       GFP3
## Xkr4    55.85028   0.2511356   0.4695154   0.000000   0.1295357  0.2122026
## Rp1    149.79984   2.5741395   1.1268368   1.347108   4.0156080  2.6878996
## Sox17   48.10439  12.3056423   4.2725897   8.178869   3.8860723  3.3245074
## Mrpl15  64.56439  87.5835253  90.3347541  98.964310  96.7631997 99.7352207
## Lypla1  37.39233 108.7416960 118.5056754 137.356881 135.9693512 95.4911687
## Tcea1   43.75502  17.9561923  23.1940585  17.704845  11.3559668 10.4686615

# Which values in myCPM are greater than 0.5?
thresh <- myCPM > 0.5
# This produces a logical matrix with TRUEs and FALSEs
head(thresh)

##         OE1   OE2   OE3  GFP1  GFP2  GFP3
## Xkr4   TRUE FALSE FALSE FALSE FALSE FALSE
## Rp1    TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## Sox17  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## Mrpl15 TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## Lypla1 TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## Tcea1  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

# Summary of how many TRUEs there are in each row
# There are 11433 genes that have TRUEs in all 12 samples.
table(rowSums(thresh))

## 
##     0     1     2     3     4     5     6 
##    21  8733   898   531   531   475 12148

# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2
# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- countdata[keep,]
summary(keep)

##    Mode   FALSE    TRUE 
## logical    8754   14583

dim(counts.keep)

## [1] 14583     6

A CPM of 0.5 is used as it corresponds to a count of 10-15 for the library sizes in this data set. If the count is any smaller, it is considered to be very low, indicating that the associated gene is not expressed in that sample.

# Let's have a look and see whether our threshold of 0.5 does indeed correspond to a count of about 10-15
# We will look at the first sample
plot(myCPM[,1],countdata[,1])

# Let us limit the x and y-axis so we can actually look to see what is happening at the smaller counts
plot(myCPM[,1],countdata[,1],ylim=c(0,50),xlim=c(0,3))
# Add a vertical line at 0.5 CPM
abline(v=0.5)

Convert counts to DGEList object

Next we’ll create a DGEList object. This is an object used by edgeR to store count data. It has a number of slots for storing various parameters about the data.

dgeObj <- DGEList(counts.keep)
# have a look at dgeObj
dgeObj

## An object of class "DGEList"
## $counts
##         OE1  OE2  OE3 GFP1 GFP2 GFP3
## Rp1    9747   41   24   28   93   38
## Sox17  3130  196   91  170   90   47
## Mrpl15 4201 1395 1924 2057 2241 1410
## Lypla1 2433 1732 2524 2855 3149 1350
## Tcea1  2847  286  494  368  263  148
## 14578 more rows ...
## 
## $samples
##      group lib.size norm.factors
## OE1      1 50280843            1
## OE2      1 15915232            1
## OE3      1 21287077            1
## GFP1     1 20776498            1
## GFP2     1 23149145            1
## GFP3     1 14128030            1

# See what slots are stored in dgeObj
names(dgeObj)

## [1] "counts"  "samples"

# Library size information is stored in the samples slot
dgeObj$samples

##      group lib.size norm.factors
## OE1      1 50280843            1
## OE2      1 15915232            1
## OE3      1 21287077            1
## GFP1     1 20776498            1
## GFP2     1 23149145            1
## GFP3     1 14128030            1

Quality control

Now that we have got rid of the lowly expressed genes and have our counts stored in a DGEList object, we can look at a few different plots to check that the data is good quality, and that the samples are as we would expect.

Library sizes and distribution plots

First, we can check how many reads we have for each sample in the dgeObj.

dgeObj$samples$lib.size

## [1] 50280843 15915232 21287077 20776498 23149145 14128030

We can also plot the library sizes as a barplot to see whether there are any major discrepanies between the samples more easily.

# The names argument tells the barplot to use the sample names on the x-axis
# The las argument rotates the axis names
barplot(dgeObj$samples$lib.size, names=colnames(dgeObj), las=2)
# Add a title to the plot
title("Barplot of library sizes")

Count data is not normally distributed, so if we want to examine the distributions of the raw counts we need to log the counts. Next we’ll use box plots to check the distribution of the read counts on the log2 scale. We can use the cpm function to get log2 counts per million, which are corrected for the different library sizes. The cpm function also adds a small offset to avoid taking log of zero.

# Get log2 counts per million
logcounts <- cpm(dgeObj,log=TRUE)
# Check distributions of samples using boxplots
boxplot(logcounts, xlab="", ylab="Log2 counts per million",las=2)
# Let's add a blue horizontal line that corresponds to the median logCPM
abline(h=median(logcounts),col="blue")
title("Boxplots of logCPMs (unnormalised)")

From the boxplots we see that overall the density distributions of raw log-intensities are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further. And in this case, we doubt that sample OE1 may indicate outlier sample.

Multidimensional scaling plots

An MDSplot is a visualisation of a principle components analysis. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. We can use the plotMDS function to create the MDS plot.

# Redo the MDSplot with corrected information
col.cell <- c("purple","orange")[sampleinfo$status]
plotMDS(dgeObj,col=col.cell)
legend("topright",fill=c("purple","orange"),legend=levels(sampleinfo$status), pos)
title("Cell type")

Another alternative is to generate an interactive MDS plot using the Glimma package. This allows the user to interactively explore the different dimensions.

labels <- paste(sampleinfo$title, sampleinfo$status)
group <- sampleinfo$status
glMDSPlot(dgeObj, labels=labels, groups=group, folder="mds")

Glimma was created to make interactive versions of some of the popular plots from the limma package. At present it can be used to obtain MDS plots and mean-difference (MD) plots, which will be covered later. The output of glMDSPlot is an html page (/mds/MDS-Plot.html) that shows the MDS plot on the left, and the amount of variation explained by each dimension in a barplot on the right. The user can hover over points to find out sample information, and switch between successive dimensions in the MDS plot by clicking on the bars in the barplot. The default MDS plots shows dimensions 1 and 2.

Hierarchical clustering with heatmaps

An alternative approach to plotMDS for examining relationships between samples is using hierarchical clustering. Heatmaps are a nice visualisation to examine hierarchical clustering of your samples. We can do this using the heatmap.2 function from the gplots package. In this example heatmap.2 calculates a matrix of euclidean distances from the logCPM (logcounts object) for the 500 most variable genes. (Note this has more complicated code than plotting principle components using plotMDS.)

Let’s select data for the 500 most variable genes and plot the heatmap

# We estimate the variance for each row in the logcounts matrix
var_genes <- apply(logcounts, 1, var)
head(var_genes)

##        Rp1      Sox17     Mrpl15     Lypla1      Tcea1    Atp6v1h 
## 7.37494829 2.48666024 0.01090041 0.31523921 0.76800042 0.09455902

# Get the gene names for the top 500 most variable genes
select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
head(select_var)

## [1] "Ddx3y"   "Eif2s3y" "Uty"     "Kdm5d"   "Xist"    "Bmp10"

# Subset logcounts matrix
highly_variable_lcpm <- logcounts[select_var,]
dim(highly_variable_lcpm)

## [1] 500   6

head(highly_variable_lcpm)

##              OE1       OE2        OE3      GFP1      GFP2      GFP3
## Ddx3y   6.529260 -3.600278  6.6714104  6.747055 -3.600278  6.492581
## Eif2s3y 5.141785 -3.600278  6.5333585  6.744467 -3.600278  6.630892
## Uty     6.701250 -3.600278  4.7822478  5.063393 -3.600278  5.413526
## Kdm5d   6.766745 -3.600278  4.3388182  4.105778 -3.600278  4.379993
## Xist    8.477519  7.321857 -0.1826881 -1.096950  6.387398 -0.474964
## Bmp10   4.715637  5.396353 -3.6002782  2.044563  5.038692 -3.600278

## Get some nicer colours
mypalette <- brewer.pal(11,"RdYlBu")
morecols <- colorRampPalette(mypalette)
# Set up colour vector for celltype variable
col.cell <- c("purple","orange")[sampleinfo$status]

# Plot the heatmap
heatmap.2(highly_variable_lcpm, 
          col=rev(morecols(50)),
          trace="column", 
          main="Top 500 most variable genes across samples",
          ColSideColors=col.cell,scale="row")

# Save the heatmap
png(file="High_var_genes.heatmap.png")
heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes\nacross samples",ColSideColors=col.cell,scale="row")
dev.off()

## png 
##   2

Differentially Expressed Genes

Now let’s move to differentially expressed genes analysis

samples = sampleinfo[, c(1,3)]
row.names(samples) = sampleinfo$title
expr_srf = seqdata[keep,2:7]
rownames(expr_srf) = seqdata[keep,1]
design <- as.formula(~status)
design <- model.matrix(design, data = samples)
ddsObj <- DESeqDataSetFromMatrix(countData = expr_srf,
                                 colData = samples,
                                 design = design)
vstcounts <- vst(ddsObj, blind=TRUE)
#plotPCA(vstcounts, intgroup="status")
ddsObj <- estimateSizeFactors(ddsObj)
ddsObj <- estimateDispersions(ddsObj)

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

ddsObj <- nbinomWaldTest(object = ddsObj,modelMatrix = design)
res <- results(ddsObj, alpha=0.05)
head(res)

## log2 fold change (MLE): statusOE 
## Wald test p-value: statusOE 
## DataFrame with 6 rows and 6 columns
##                 baseMean     log2FoldChange             lfcSE
##                <numeric>          <numeric>         <numeric>
## Rp1     284.427371843602    2.6549425449046  1.31661945144577
## Sox17   211.513148148464    0.9706681146393 0.714084763483921
## Mrpl15  2223.39610545563 -0.791015111594018 0.585443552013981
## Lypla1  2716.32563223496 -0.845126773895334 0.771272650615312
## Tcea1   418.813850191803  0.326381443546781 0.388814545068391
## Atp6v1h  1219.9405154697 -0.274435044471403  0.69353394773004
##                       stat            pvalue              padj
##                  <numeric>         <numeric>         <numeric>
## Rp1       2.01648436986802                NA                NA
## Sox17      1.3593177788916 0.174045912645067 0.608497331931411
## Mrpl15   -1.35113813940362 0.176651186109257 0.609108424977331
## Lypla1   -1.09575618067243  0.27318548947567 0.668674721365126
## Tcea1    0.839427042240334 0.401229714145713  0.70617140501417
## Atp6v1h -0.395705279272396 0.692322460451111 0.834780863706197

topGenesPvV <- as.data.frame(res) %>%
  rownames_to_column("GeneID") %>% 
  arrange(padj) %>% 
  head(1000)
topGenesPvV[which(topGenesPvV$padj<=0.05),1]

## [1] "Col27a1" "Col7a1"  "Cd74"    "Cnr1"

Normalisation for composition bias

Usually we will perform the trimmed mean of M-values normalization method (TMM) to eliminate composition biases between libraries. Due to the time limit, please refer to other resources.

save(group,dgeObj,sampleinfo,file="~/Desktop/R_projects/data/preprocessing.Rdata")