Objective

Previously, we had aligned a set of RNA-seq reads to the Mycobacterium smegmatis MC2 155 genome assembly and performed a basic analysis, resulting in a table of counts. Now we will perform an analysis on this table of counts to detect differentially expressed genes. To do this, we will use the DESeq2 package from Bioconductor in R.

Step 0: Required Libraries

DESeq2 is a package from Bioconductor, an open source set of software tools in R for bioinformatics. To install:

# If you have not installed Bioconductor, do so here!

# source("https://bioconductor.org/biocLite.R")
# biocLite()

# Once this is completed:
biocLite("DESeq2")
# You may then use library(DESeq2) for subsequent loads.

To see the vignette, or training materials, associated with the DESeq2 package (highly recommended) use the following command:

vignette("DESeq2")

Step 1: Preparing Data

At this point, we should have three seperate datasets: SRR647673_htseq.txt, SRR647674_htseq.txt, and SRR647675_htseq.txt. Load these into your working environment in R, then examine one of the files with the head() function.

# Note: You will need to change the relative path to wherever you have downloaded these files.

SRR647673 <- read.table("SRR647673_htseq.txt",head=FALSE)
SRR647674 <- read.table("SRR647674_htseq.txt",head=FALSE)
SRR647675 <- read.table("SRR647675_htseq.txt",head=FALSE)
head(SRR647673)

##        V1   V2
## 1  gene:1 2046
## 2 gene:10  153
## 3 gene:11   42
## 4 gene:12   78
## 5 gene:13   19
## 6 gene:14   67

As can be seen, R has automatically assigned a name to the two columns: V1, the identified gene, and V2, the count of reads that have been assigned to that read. Of importance is the fact that the count data here has not been normalized. Why? For DESeq2’s statistical model to hold, raw counts must be used as these allow for assessing the measurement precision correctly. It is important to never provide counts that were pre-normalized for sequencing depth/library size to DESeq2.

We now need to merge these datasets to obtain a matrix of counts. Merges (or, more abstractly, joins) can be extraordinarily complicated tasks in different computational environment - R is no exception. To begin, it is always good practice to ensure that you have a common column (a “key”) that is found across all datasets. Here, it is V1 (which we will rename “Gene” in the matrix). To ensure that these are indeed the same, we can use a logic statement. We will check to see if all elements of the V1 column are the same across all three datasets.

all(SRR647673[,1]==SRR647674[,1]) && all(SRR647673[,1]==SRR647675[,1]) && all(SRR647674[,1]==SRR647675[,1])

## [1] TRUE

Great! We can perform a merge. There are (literally) a million ways to perform a merge in R. As we know that our key column is identical across the three datasets, our circumstances are trivial and can actually be performed easily via creating a new data.frame object.

countData<-data.frame(row.names =SRR647673[,1],SRR647673=SRR647673[,2],SRR647674=SRR647674[,2],SRR647675=SRR647675[,2])
head(countData)

##         SRR647673 SRR647674 SRR647675
## gene:1       2046      5563        17
## gene:10       153     84585        24
## gene:11        42     23166       340
## gene:12        78     37991        34
## gene:13        19      9037         0
## gene:14        67     34953        66

tail(countData)

##         SRR647673 SRR647674 SRR647675
## gene:76       229       525       139
## gene:77       329       436        79
## gene:78      1753      3023       130
## gene:8         90     18852       149
## gene:80         0         0         0
## gene:9        318     77199        69

## Note that all samples contain a 0 for "gene:80", so remove it
countData<-countData[-77,]
tail(countData)

##         SRR647673 SRR647674 SRR647675
## gene:75       146       317       164
## gene:76       229       525       139
## gene:77       329       436        79
## gene:78      1753      3023       130
## gene:8         90     18852       149
## gene:9        318     77199        69

The class used by the DESeq2 package to store the read counts is a DESeqDataSet object, which extends the Ranged-SummarizedExperiment class of the SummarizedExperiment package. This facilitates preparation steps and also downstream exploration of results. It is highly recommended to be come familiar with these object types (use vignette(SummarizedExperiment)), as they are used widely throughout the Bioconductor landscape in R.

In practical terms, this means we need to create a dataset of meta data, and then merge both our countData and metaData into one DESeqDataSet object.

(metaData<-data.frame(row.names = colnames(countData), condition = c("30MinPI","2point5HoursPI","Lysogen")))

##                condition
## SRR647673        30MinPI
## SRR647674 2point5HoursPI
## SRR647675        Lysogen

We can now create our DESeqDataSet object:

deseq.dat<- DESeqDataSetFromMatrix(countData = countData,
                                   colData = metaData,
                                   design = ~ condition)
deseq.dat

## class: DESeqDataSet 
## dim: 77 3 
## metadata(1): version
## assays(1): counts
## rownames(77): gene:1 gene:10 ... gene:8 gene:9
## rowData names(0):
## colnames(3): SRR647673 SRR647674 SRR647675
## colData names(1): condition

head(assay(deseq.dat))

##         SRR647673 SRR647674 SRR647675
## gene:1       2046      5563        17
## gene:10       153     84585        24
## gene:11        42     23166       340
## gene:12        78     37991        34
## gene:13        19      9037         0
## gene:14        67     34953        66

As with R: there’s more than one way to do things! If you’d like to short-cut the above manual steps, you can utilize the following commands below. Note that these commands will only work if your working directory is where the .txt files are located.

sampleFiles<-grep("SRR",list.files(getwd()),value=TRUE)
sampleCondition<-c("30MinPI","2point5HoursPI","Lysogen")
sampleTable<-data.frame(sampleName=gsub("_htseq.txt","",sampleFiles),
                        fileName = sampleFiles,
                        condition=sampleCondition)
deseq.dat2<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
                                     directory = getwd(),
                                     design= ~ condition)
deseq.dat2

## class: DESeqDataSet 
## dim: 78 3 
## metadata(1): version
## assays(1): counts
## rownames(78): gene:1 gene:10 ... gene:80 gene:9
## rowData names(0):
## colnames(3): SRR647673 SRR647674 SRR647675
## colData names(1): condition

head(assay(deseq.dat2))

##         SRR647673 SRR647674 SRR647675
## gene:1       2046      5563        17
## gene:10       153     84585        24
## gene:11        42     23166       340
## gene:12        78     37991        34
## gene:13        19      9037         0
## gene:14        67     34953        66

Step 2: Using DESeq2

The standard differential expression analysis steps are wrapped into a single function, DESeq. Results tables are generated using the function results, which extracts a results table with \(\log_2\) fold changes, p values and adjusted p values. With no arguments to results, the results will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the first level. Details about the comparison are printed to the console. The text, condition treated vs untreated, tells you that the estimates are of the logarithmic fold change log2(treated/untreated).

dds <- DESeq(deseq.dat,quiet=TRUE)

## Warning in checkForExperimentalReplicates(object, modelMatrix): same number of samples and coefficients to fit,
##   estimating dispersion by treating samples as replicates.
##   read the ?DESeq section on 'Experiments without replicates'

## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.

Uh oh! You’ll notice from the above that DESeq2 has thrown an error. This isn’t a programmatic error, but a more fundamental one related to experimental design. DESeq2 is configured to work explicitly with experiments that have replication, Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis. If an experimental design is supplied which does not contain the necessary degrees of freedom for differential analysis, DESeq2 will angrily provide you the above error but continue it’s calculations. However, all of the samples will now be considered as replicates of a single group for the estimation of dispersion. The ??DESeq2 function will tell you: “Some overestimation of the variance may be expected, which will make that approach conservative.” This is a bit of an overstatement, as shown when you examine your pairwise data.

head(results(dds))

## log2 fold change (MAP): condition Lysogen vs 2point5HoursPI 
## Wald test p-value: condition Lysogen vs 2point5HoursPI 
## DataFrame with 6 rows and 6 columns
##          baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##         <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## gene:1   884.0707     -2.3093350  1.855291 -1.244729 0.2132313 0.4906545
## gene:10 5568.1980     -4.5829000  2.465534 -1.858786 0.0630575 0.4097192
## gene:11 2620.0824     -0.3520614  2.210830 -0.159244 0.8734766 0.9601247
## gene:12 2579.3242     -3.6052714  2.284374 -1.578231 0.1145125 0.4640768
## gene:13  587.0899     -3.5018195  2.741749 -1.277221 0.2015242 0.4906545
## gene:14 2485.9970     -2.7782088  2.245812 -1.237062 0.2160641 0.4906545

summary(results(dds))

## 
## out of 77 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 0, 0% 
## LFC < 0 (down)   : 0, 0% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 0, 0% 
## (mean count < 39)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

As you can see - we have found no differentially expressed genes due to our lack of statistical pwoer. For an interesting history on this topic, a look at this aging thread might be of value.

There is, however, a workaround for our particular case! Though we will not be able to assign p-values to any of our differentially expressed genes, we still have the capability of performing \(\log_2\) transformations to check for absolute (and not statistical) differences in expression.

rld <- rlogTransformation(dds)

## Warning in sparseTest(counts(object, normalized = TRUE), 0.9, 100, 0.1): the rlog assumes that data is close to a negative binomial distribution, an assumption
## which is sometimes not compatible with datasets where many genes have many zero counts
## despite a few very large counts.
## In this data, for 24.7% of genes with a sum of normalized counts above 100, it was the case 
## that a single sample's normalized count made up more than 90% of the sum over all samples.
## the threshold for this warning is 10% of genes. See plotSparsity(dds) for a visualization of this.
## We recommend instead using the varianceStabilizingTransformation or shifted log (see vignette).

## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.

res <- data.frame(
   assay(rld))
res$avgLogExpr <- ( res[,1]+res[,2]+res[,3] ) / 3
res$LogFC.73.74 <- res$SRR647673 - res$SRR647674
res$LogFC.73.75 <- res$SRR647673 - res$SRR647675
res$LogFC.74.75 <- res$SRR647674 - res$SRR647675 
head(res)

##         SRR647673 SRR647674 SRR647675 avgLogExpr LogFC.73.74 LogFC.73.75
## gene:1  10.258202  9.937125  7.817793   9.337707   0.3210776   2.4404091
## gene:10  7.949711 13.032059  8.587195   9.856322  -5.0823483  -0.6374833
## gene:11  6.474716 11.676800 11.330747   9.827421  -5.2020838  -4.8560311
## gene:12  6.709448 12.166878  8.567606   9.147977  -5.4574298  -1.8581581
## gene:13  4.084918 10.061743  2.728522   5.625061  -5.9768253   1.3563966
## gene:14  6.640638 12.104616  9.349334   9.364863  -5.4639789  -2.7086966
##         LogFC.74.75
## gene:1    2.1193315
## gene:10   4.4448649
## gene:11   0.3460527
## gene:12   3.5992717
## gene:13   7.3332219
## gene:14   2.7552823

It’s important to understand what the rlogTransformation function is doing. This function transforms the count data to the \(\log_2\) scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. It’s called the regularized \(\log_2\) transformation, and it’s a method for variance stabilization. This paper offers some pretty good clarification for why this is important.

Here’s a good example of why the rlog transformation is appropriate:

par( mfrow = c( 2, 3 ) )
ddplot <- estimateSizeFactors(dds)
plot(log2(counts(dds, normalized=TRUE)[,1:2] + 1),
     pch=16, cex=0.3) +
plot(log2(counts(dds, normalized=TRUE)[,c(1,3)] + 1),
     pch=16, cex=0.3, main = "Normal Log2") + 
plot(log2(counts(dds, normalized=TRUE)[,2:3] + 1),
     pch=16, cex=0.3) +
plot(assay(rld)[,1:2],
     pch=16, cex=0.3) +
plot(assay(rld)[,c(1,3)],
     pch=16, cex=0.3, main="Regularized Log2") +
plot(assay(rld)[,2:3],
     pch=16, cex=0.3)

We can see how genes with low counts (bottom left-hand corner of the first two plots on the top row) seem to be excessively variable on the ordinary logarithmic scale, while the rlog transform compresses differences for the low count genes for which the data provide little information about differential expression.

Step 3: Exploratory Data Analysis

Exploratory data analysis (EDA) is an incredibly importat part of bioinformatics (and data analysis in general). There are many great resources on this topic available (like this, this, and this). Below are some of the more standardized tasks for differential analysis.

Heat Map

A useful first step in an RNA-seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? A good way to assess this is to find the Euclidian (or sometimes Poisson) distance and plot it with a heatmap.

To do this, we’ll need a few extra libraries:

library("pheatmap")
library("RColorBrewer")

We’ll use the dist function to get the Euclidian distance on a transposed array.

sampleDists <- dist( t( assay(rld) ) )
sampleDists

##           SRR647673 SRR647674
## SRR647674  30.44802          
## SRR647675  22.36172  25.18924

Then we simply convert to a matrix and plot.

sampleDistMatrix <- as.matrix( sampleDists )
# This is just to make the heatmap more attractive
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors,
         main = "A Heatmap of Similarities Between Groups")

MA Plot

The MA plot allows you to look at the relationship between intensity and difference between two different sets of data. More specifically, an MA-plot is a plot of log-intensity ratios (M-values) versus log-intensity averages (A-values).

plotMA(results(dds),main = "MA Plot of dds",ylim=c(-4,4))

The default behavior of the plotMA function would be to highlight differentially expressed genes (as determined by the adjusted p-value). We can again confirm that we have none via the following method:

table(results(dds)[,"padj"]<0.1)

## 
## FALSE 
##    77

To see the standard (or expected) output of the plotMA function, we can use the airway dataset.

library("airway")
data("airway")
se <- airway
dat<-DESeqDataSet(se, design = ~ cell + dex)
ex<-DESeq(dat)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

exrl<-rlog(ex)
rs<-results(ex)
plotMA(rs,main="MA Plot of Airway Data",ylim=c(-4,4))

Perhaps you want to highlight and label a random point on your plot?

plotMA(rs, main="MA Plot of Airway Data",ylim=c(-5,5))
topGene <- rownames(rs)[which.min(rs$padj)]
with(rs[topGene, ], {
  points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
  text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})

PCA Plot

Another way to visualize sample-to-sample distances is a principal components analysis (PCA). Principal component analysis is a powerful tool from mathematical statistics: it underpins a lot of your daily experinece with predictive algorithms (like Netflix). A really good explanation of it is from this question on Stack Exchange. Basically: data points are projected onto the 2D plane such that they spread out in the two directions that explain most of the differences. The x-axis is the direction that separates the data points the most. The values of the samples in this direction are written PC1. The y-axis is a direction that separates the data the second most. The values of the samples in this direction are written PC2. The percent of the total variance that is contained in the direction is printed in the axis label. Note that these percentages do not add to 100%, because there are more dimensions that contain the remaining variance (although each of these remaining dimensions will explain less than the two that we see).

require(ggplot2,ggthemes)
plotPCA(rld) + ggtitle("PCA Plot of rLog Data") + theme_few()

With our airway dataset:

plotPCA(exrl,intgroup=c("dex","cell"))+ ggtitle("PCA Plot of Airway Data") + theme_few()

Step 4: Annotation

An important step taken towards the end of an analysis pipeline is annotating your results. What we’ll go through now is retrieving an annotation file and merging it with our table to produce an annotated list of genes. We’ll then write that file out to our working directory.

First, download the annotation file and save it in your working directory. You can find that file here. Now, read that file into R.

annotation <- read.table(file="giles_annotation.txt",sep="\t",head=T,stringsAsFactors=FALSE)
head(annotation)

##    Symbol Start  End Strand Product            Note      ProteinID
## 1 Giles_1    14  571      +     gp1                 YP_001552330.1
## 2 Giles_2   630  800      -     gp2                 YP_001552331.1
## 3 Giles_3   930 1553      -     gp3                 YP_001552332.1
## 4 Giles_4  1617 1826      -     gp4                 YP_001552333.1
## 5 Giles_5  1855 3489      +     gp5       terminase YP_001552334.1
## 6 Giles_6  3506 4915      +     gp6 probably portal YP_001552335.1
##   NCBIGeneID
## 1    5758937
## 2    5758932
## 3    5758939
## 4    5758936
## 5    5758934
## 6    5758935

As you can see, the annotation file contains a lot of useful information. What we need to do is now combine it with the res file that we created in a previous step. Let’s examine that file again.

head(res)

##         SRR647673 SRR647674 SRR647675 avgLogExpr LogFC.73.74 LogFC.73.75
## gene:1  10.258202  9.937125  7.817793   9.337707   0.3210776   2.4404091
## gene:10  7.949711 13.032059  8.587195   9.856322  -5.0823483  -0.6374833
## gene:11  6.474716 11.676800 11.330747   9.827421  -5.2020838  -4.8560311
## gene:12  6.709448 12.166878  8.567606   9.147977  -5.4574298  -1.8581581
## gene:13  4.084918 10.061743  2.728522   5.625061  -5.9768253   1.3563966
## gene:14  6.640638 12.104616  9.349334   9.364863  -5.4639789  -2.7086966
##         LogFC.74.75
## gene:1    2.1193315
## gene:10   4.4448649
## gene:11   0.3460527
## gene:12   3.5992717
## gene:13   7.3332219
## gene:14   2.7552823

What we need to do is find a “key” to merge our datasets by. We do not have a unified key that stands out, so we’ll need to make one. Did you notice something interesting about the rownames of res and the first column of annotation?

rownames(res)

##  [1] "gene:1"  "gene:10" "gene:11" "gene:12" "gene:13" "gene:14" "gene:15"
##  [8] "gene:16" "gene:17" "gene:18" "gene:19" "gene:2"  "gene:20" "gene:21"
## [15] "gene:22" "gene:23" "gene:24" "gene:25" "gene:26" "gene:27" "gene:28"
## [22] "gene:29" "gene:3"  "gene:30" "gene:31" "gene:32" "gene:33" "gene:34"
## [29] "gene:35" "gene:36" "gene:37" "gene:38" "gene:39" "gene:4"  "gene:40"
## [36] "gene:41" "gene:42" "gene:43" "gene:44" "gene:45" "gene:46" "gene:47"
## [43] "gene:49" "gene:5"  "gene:50" "gene:51" "gene:52" "gene:53" "gene:54"
## [50] "gene:55" "gene:56" "gene:57" "gene:58" "gene:59" "gene:6"  "gene:60"
## [57] "gene:61" "gene:62" "gene:63" "gene:64" "gene:65" "gene:66" "gene:67"
## [64] "gene:68" "gene:69" "gene:7"  "gene:70" "gene:71" "gene:72" "gene:73"
## [71] "gene:74" "gene:75" "gene:76" "gene:77" "gene:78" "gene:8"  "gene:9"

annotation$Symbol

##  [1] "Giles_1"  "Giles_2"  "Giles_3"  "Giles_4"  "Giles_5"  "Giles_6" 
##  [7] "Giles_7"  "Giles_8"  "Giles_9"  "Giles_10" "Giles_11" "Giles_12"
## [13] "Giles_13" "Giles_14" "Giles_15" "Giles_16" "Giles_17" "Giles_18"
## [19] "Giles_19" "Giles_20" "Giles_21" "Giles_22" "Giles_23" "Giles_24"
## [25] "Giles_25" "Giles_26" "Giles_27" "Giles_28" "Giles_29" "Giles_30"
## [31] "Giles_31" "Giles_32" "Giles_33" "Giles_34" "Giles_35" "Giles_36"
## [37] "Giles_37" "Giles_38" "Giles_39" "Giles_40" "Giles_41" "Giles_42"
## [43] "Giles_43" "Giles_44" "Giles_45" "Giles_46" "Giles_47" "Giles_49"
## [49] "Giles_50" "Giles_51" "Giles_52" "Giles_53" "Giles_54" "Giles_55"
## [55] "Giles_56" "Giles_57" "Giles_58" "Giles_59" "Giles_60" "Giles_61"
## [61] "Giles_62" "Giles_63" "Giles_64" "Giles_65" "Giles_66" "Giles_67"
## [67] "Giles_68" "Giles_69" "Giles_70" "Giles_71" "Giles_72" "Giles_73"
## [73] "Giles_74" "Giles_75" "Giles_76" "Giles_77" "Giles_78"

In order to “build” a key, we’ll need to do some string manipulation. First, we’ll need to make a column vector in the res dataframe that contains the “gene:##” string, as leaving it as a rowname won’t allow us to perform our merge.

res$Gene<-rownames(res)
head(res)

##         SRR647673 SRR647674 SRR647675 avgLogExpr LogFC.73.74 LogFC.73.75
## gene:1  10.258202  9.937125  7.817793   9.337707   0.3210776   2.4404091
## gene:10  7.949711 13.032059  8.587195   9.856322  -5.0823483  -0.6374833
## gene:11  6.474716 11.676800 11.330747   9.827421  -5.2020838  -4.8560311
## gene:12  6.709448 12.166878  8.567606   9.147977  -5.4574298  -1.8581581
## gene:13  4.084918 10.061743  2.728522   5.625061  -5.9768253   1.3563966
## gene:14  6.640638 12.104616  9.349334   9.364863  -5.4639789  -2.7086966
##         LogFC.74.75    Gene
## gene:1    2.1193315  gene:1
## gene:10   4.4448649 gene:10
## gene:11   0.3460527 gene:11
## gene:12   3.5992717 gene:12
## gene:13   7.3332219 gene:13
## gene:14   2.7552823 gene:14

We’ll perform our string manipulation on the annotation data. What we need to do is:

Seperate the string that we want (“Symbol”) into useful components
Rename one part of that component so that it matches the correct element of the “Gene” vector in res.

## Look at the annotation Gene Symbols
annotation$Symbol

##  [1] "Giles_1"  "Giles_2"  "Giles_3"  "Giles_4"  "Giles_5"  "Giles_6" 
##  [7] "Giles_7"  "Giles_8"  "Giles_9"  "Giles_10" "Giles_11" "Giles_12"
## [13] "Giles_13" "Giles_14" "Giles_15" "Giles_16" "Giles_17" "Giles_18"
## [19] "Giles_19" "Giles_20" "Giles_21" "Giles_22" "Giles_23" "Giles_24"
## [25] "Giles_25" "Giles_26" "Giles_27" "Giles_28" "Giles_29" "Giles_30"
## [31] "Giles_31" "Giles_32" "Giles_33" "Giles_34" "Giles_35" "Giles_36"
## [37] "Giles_37" "Giles_38" "Giles_39" "Giles_40" "Giles_41" "Giles_42"
## [43] "Giles_43" "Giles_44" "Giles_45" "Giles_46" "Giles_47" "Giles_49"
## [49] "Giles_50" "Giles_51" "Giles_52" "Giles_53" "Giles_54" "Giles_55"
## [55] "Giles_56" "Giles_57" "Giles_58" "Giles_59" "Giles_60" "Giles_61"
## [61] "Giles_62" "Giles_63" "Giles_64" "Giles_65" "Giles_66" "Giles_67"
## [67] "Giles_68" "Giles_69" "Giles_70" "Giles_71" "Giles_72" "Giles_73"
## [73] "Giles_74" "Giles_75" "Giles_76" "Giles_77" "Giles_78"

## Split the string at the "_" character to isolate the number
split <- strsplit(as.character(annotation$Symbol),'_') 
## Turn this list of split strings into a dataframe
split<- data.frame(do.call(rbind, split))
## Append "gene:" to the front of the second column
split[,2] <- paste("gene:",split[,2],sep="")
## Append the second column back to the annotation data frame
annotation$Key <- split[,2]
head(annotation)

##    Symbol Start  End Strand Product            Note      ProteinID
## 1 Giles_1    14  571      +     gp1                 YP_001552330.1
## 2 Giles_2   630  800      -     gp2                 YP_001552331.1
## 3 Giles_3   930 1553      -     gp3                 YP_001552332.1
## 4 Giles_4  1617 1826      -     gp4                 YP_001552333.1
## 5 Giles_5  1855 3489      +     gp5       terminase YP_001552334.1
## 6 Giles_6  3506 4915      +     gp6 probably portal YP_001552335.1
##   NCBIGeneID    Key
## 1    5758937 gene:1
## 2    5758932 gene:2
## 3    5758939 gene:3
## 4    5758936 gene:4
## 5    5758934 gene:5
## 6    5758935 gene:6

We now have our key!

Important Note: The only reason we could “append” the above column is that we knew that the columns were already ordered and we had done nothing to change that order. In many situations, you may need to “merge” the data in the fashion that we will do so below.

Now that we have our key, we can perform a merge.

anno.df<-merge(res,annotation,by.x="Gene",by.y="Key",all=T)
head(anno.df)

##      Gene SRR647673 SRR647674 SRR647675 avgLogExpr LogFC.73.74 LogFC.73.75
## 1  gene:1 10.258202  9.937125  7.817793   9.337707   0.3210776   2.4404091
## 2 gene:10  7.949711 13.032059  8.587195   9.856322  -5.0823483  -0.6374833
## 3 gene:11  6.474716 11.676800 11.330747   9.827421  -5.2020838  -4.8560311
## 4 gene:12  6.709448 12.166878  8.567606   9.147977  -5.4574298  -1.8581581
## 5 gene:13  4.084918 10.061743  2.728522   5.625061  -5.9768253   1.3563966
## 6 gene:14  6.640638 12.104616  9.349334   9.364863  -5.4639789  -2.7086966
##   LogFC.74.75   Symbol Start  End Strand Product           Note
## 1   2.1193315  Giles_1    14  571      +     gp1               
## 2   4.4448649 Giles_10  7539 7979      +    gp10 virion protein
## 3   0.3460527 Giles_11  7983 8435      +    gp11               
## 4   3.5992717 Giles_12  8435 9118      +    gp12 virion protein
## 5   7.3332219 Giles_13  9121 9528      +    gp13               
## 6   2.7552823 Giles_14  9521 9922      +    gp14               
##        ProteinID NCBIGeneID
## 1 YP_001552330.1    5758937
## 2 YP_001552339.1    5758944
## 3 YP_001552340.1    5758941
## 4 YP_001552341.1    5758950
## 5 YP_001552342.1    5758956
## 6 YP_001552343.1    5758940

Voila! You now have a merged data.frame with annotation information. You can write this information out to your working directory with the following command:

write.csv(anno.df,file="AnnotatedGeneList.csv",row.names = FALSE)

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2         pheatmap_1.0.8            
##  [3] DESeq2_1.12.4              SummarizedExperiment_1.2.3
##  [5] Biobase_2.32.0             GenomicRanges_1.24.3      
##  [7] GenomeInfoDb_1.8.7         IRanges_2.6.1             
##  [9] S4Vectors_0.10.3           BiocGenerics_0.18.0       
## [11] ggthemes_3.2.0             ggplot2_2.2.0             
## 
## loaded via a namespace (and not attached):
##  [1] genefilter_1.54.2    locfit_1.5-9.1       splines_3.3.1       
##  [4] lattice_0.20-34      colorspace_1.2-7     htmltools_0.3.5     
##  [7] yaml_2.1.13          chron_2.3-47         survival_2.40-1     
## [10] XML_3.98-1.4         foreign_0.8-67       DBI_0.5-1           
## [13] BiocParallel_1.6.6   plyr_1.8.4           stringr_1.1.0       
## [16] zlibbioc_1.18.0      munsell_0.4.3        gtable_0.2.0        
## [19] evaluate_0.10        labeling_0.3         latticeExtra_0.6-28 
## [22] knitr_1.14           geneplotter_1.50.0   AnnotationDbi_1.34.4
## [25] Rcpp_0.12.7          acepack_1.4.1        xtable_1.8-2        
## [28] scales_0.4.1         formatR_1.4          Hmisc_3.17-4        
## [31] annotate_1.50.1      XVector_0.12.1       gridExtra_2.2.1     
## [34] digest_0.6.10        stringi_1.1.2        grid_3.3.1          
## [37] tools_3.3.1          bitops_1.0-6         magrittr_1.5        
## [40] lazyeval_0.2.0       RCurl_1.95-4.8       tibble_1.2          
## [43] RSQLite_1.0.0        Formula_1.2-1        cluster_2.0.5       
## [46] Matrix_1.2-7.1       data.table_1.9.6     assertthat_0.1      
## [49] rmarkdown_1.1        rpart_4.1-10         nnet_7.3-12

RNA-seq Differential Analysis with R

Benjamin King, PhD | bking@mdibl.org | Kyle Shank | kshank@mdibl.org |