INTRODUCTION TO TRANSCRIPTOMICS

Transcriptomes are cool molecular phenotypes. They have many useful features for studying evolutionary change.

Easy to measure across many dimensions.
Homology across species relatively easy to establish.
Can connect to the genotype relatively easily.
Many adaptations involve gene expression changes (especially in multicellular organisms).
Provide diverse info in addition to expression: DNA variants, RNA modifications, allele-specific expression, etc.
Provide insight into histology, if done on heterogeneous tissues.

They also pose challenges:

Expression variance = genetic + environmental + technical. RNA processing sensitive to technical effects (degradation, batch effects), perhaps more than other phenotypes.
More expensive than morphometry.
Challenging when genomes and gene annotation are lacking.
The effect on macrophenotypes not straightforward: many expression changes may be totally silent at the macrophenotype level.
No good neutral model (is a difference, if genetic, representing drift or selection?). Partly due to the influence of external (technical and environmental) effects.
No good universal models of regulation (what does the difference represent?).
Therefore, effect on fitness not straightforward: Differential expression is not proof of positive selection. You need extra evidence to conclude on positive selection, e.g. finding multiple cis mutations changing expression in the same direction, e.g. Fraser et al. 2011 PNAS.
Unless you do single cell RNA-seq, cell autonomous expression changes and shifts in cell composition are confounded.

INTRODUCTION TO THE DATASET

We will be studying a primate liver transcriptome RNA-sequencing dataset (Blekhman et al 2010 Genome Research).

It is available at NCBI GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17274. We will download the raw read counts at the end of the database web page.

The dataset contains:

6 humans, 6 chimpanzees, 6 rhesus macaques
3 females and 3 males each
2 technical replicates from each sample

It was prepared as follows, from postmortem liver data:

total RNA -> polyA mRNA -> cDNA -> random fragment -> library prep
Illumina Genome Analyzer2 RNA-sequencing
35-bp single-end reads

The three species have a phylogeny that looks like this:

# a small simulation that inserts mutations proportional to genomic differences
human = chimp = macaque = rep(0, 10000)  # the ancestral genome
# now add mutations, by replacing 0's by 1's
hominid = sample(1:10000, 300)
human[c(hominid, sample(1:10000, 50) )] = 1
chimp[c(hominid, sample(1:10000, 50) )] = 1
macaque[sample(1:10000, 350) ] = 1
# calculate euclidean distances 
distx = dist(rbind(human, chimp, macaque), method = "euclidean")
plot( hclust( distx), hang=-1, xlab = "", sub = "", main = "phylogeny")

What questions

What questions could we try to answer with such data?

Does transcriptome divergence reflect common environments?
Or does it reflect common descent (phylogeny)?
Are transcriptome differences mainly evolving neutrally?
Which genes will be differentially expressed (DE) among the species?
Will there be a sex effect? Will it be consistent among species?
Which genes are differentially expressed among females & males?
Do DE gene sets correspond to specific functions?

These will be questions we will tackle during the course.

The preprocessing steps applied in the published study:

obtain 6-7 million reads per sample
use exons from Ensembl
identify 1:1 orthologous exons amogn species
map reads to this orthologous exon set
choose un-ambiguously mapped reads
sum number of reads mapped to all exons of a gene
divide by total number of reads that mapped to genes in that lane (X 1 million), to obtain relative gene expression levels
transform & analyze

We start the analysis here from step (6).

What we will implement:

Pre-processing of the raw transcriptome matrix (expression level estimates), including summarizing across replicates, log-transformation, normalization across rows, filtering.
Differential expression estimation using ANOVA, multiple testing correction, and some functional analysis examples.

What we will not implement:

We will not perform the early processing steps: checking quality of the .fastq files, mapping to the respective genomes or transcriptomes, estimating expression levels per gene.

Here an important point is the creation of masked interspecies gtf files, whenever you will be studying multiple species. You can create such files if you have genomes for the species you are comparing. Here, you should not use each species’ gtf file as is, because biases in gene annotation among species are widespread, and can easily confound your analyses. E.g. model organisms’ genes will tend to be longer than those of non-model organisms, and will lead to lower expression level estimates for the better-annotated species, unless corrected.

Unfortunately, to my knowledge, there are no standard packages or approaches for gtf masking. One approach is to crop each species’ gtf file to only include homologous parts. For this, you can align homologous CDS using MUSCLE or similar alignment program. Alternatively, you can run reciprocal BLAST for exons, and retain those that map uniquely (e.g. He et al., 2017, Nature Neuroscience).

If you have only one genome to use as reference, this will also introduce biases if the reference is not an outgroup.

We will not cover the current standard procedure for identifying differential expression: Many researchers use DESeq2, edgeR or similar packages (fitting the raw data to a negative binomial), instead of ANOVA (assuming a normal distribution). Indeed, the former are superior, especially when sample sizes are small. However, here we also aim to (1) learn R and (2) understand the logic behind different statistical approaches, instead of providing ready recipes.

Extra notes:

Using masked gtf files means you cannot use fast pseudo-alingment methods for mapping. Otherwise using STAR and similar tools would be the most reasonable approach.
In general, using only reads that map uniquely to the genome also makes sense to avoid biases arising due to lineage-specific duplications. However, remember that you are losing data when ignoring such reads.
The scaling / normalization we will apply does not including normalization per gene size, which is part of the RPKM / FPKM / TPM statistics. Some approaches also correct for GC-bias. However, these are not a major problem for us, as we will not be comparing expression level estimates among genes, only among individuals. Plus, we are using homologous exon parts across species.
Some researchers argue for choosing flexible models (like the negative binomial) to fit count data, instead of log-transforming the data and using ANOVA. This is generally reasonable, although for many, ANOVA is more familiar and intuitive.
There are approaches such as surrogate variable analysis (Leek & Storey, 2007, Plos Genetics) that can help remove confounding factors in the dataset. But these are controversial, as they can also lead to overfitting (and thus, non-replicable results). Instead, one can also explicitly add technical variables (e.g. RNA integrity number (RIN) from Agilent Bioanalyzer, post-mortem interval, experimental batch, etc.) to the model.

READ IN THE DATA AND OVERVIEW

# set the directory to where the downloaded file is 
setwd("~/molphe_mehmet_somel")  

list.files()

##  [1] "20180316_berlin_compexprs.pptx"      
##  [2] "20180316_berlin_compexprs_code.Rmd"  
##  [3] "20180608_molphe_compexprs_code.html" 
##  [4] "20180608_molphe_compexprs_code.Rmd"  
##  [5] "20180608_molphe_compexprs_code.tex"  
##  [6] "20180608_molphe_compexprs_code_files"
##  [7] "GSE17274_ReadCountPerLane.txt"       
##  [8] "nfmat3_aovp.Rdata"                   
##  [9] "perm_nfmat3_aovp.RData"              
## [10] "perm_nfmat3_aovp2.RData"

mat = read.table( 
  "./GSE17274_ReadCountPerLane.txt", row.names=1, head=T, sep="\t" )
# head argument: first row contains column names
# row.names: first column contains row names 
# sep: use tab as seperator (actually the default)

# now, take an overview of the data

dim(mat)  # check the dimensions - is it what you expect?

## [1] 20689    36

head(mat)[,1:6] # does the data look fine? samples in columns and genes in rows

tail(mat)[,1:6]

class(mat)

## [1] "data.frame"

Data frames are useful for storing complex data, but our matrix is just numeric. Also, some functions that work on numeric data do not accept data.frame input. We therefore prefer to convert to matrix:

mat = as.matrix(mat)
class(mat)

## [1] "matrix"

# what are the column names?
colnames(mat)

##  [1] "R1L1.HSM1" "R1L2.PTF1" "R1L3.RMM1" "R1L4.HSF1" "R1L6.PTM1"
##  [6] "R1L7.RMF1" "R2L2.RMF2" "R2L3.HSM2" "R2L4.PTF2" "R2L6.RMM2"
## [11] "R2L7.HSF2" "R2L8.PTM2" "R3L1.RMM3" "R3L2.HSF2" "R3L3.PTM1"
## [16] "R3L4.RMF3" "R3L6.HSM3" "R3L7.PTF3" "R3L8.RMM1" "R4L1.HSM3"
## [21] "R4L2.HSF1" "R4L3.RMM3" "R4L4.PTF1" "R4L6.PTM2" "R4L7.RMF3"
## [26] "R4L8.HSM2" "R5L1.RMF1" "R5L2.HSM1" "R5L3.PTF3" "R5L4.RMM2"
## [31] "R5L8.RMF2" "R6L2.PTM3" "R6L4.PTM3" "R6L6.PTF2" "R8L1.HSF3"
## [36] "R8L2.HSF3"

# and row names?
head( rownames(mat)  )

## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"

Can we check if all the gene IDs are Ensembl, using grep?

grep("me", c("you", "me", "same"))

## [1] 2 3

grep("me", c("you", "me", "same"), val=T)

## [1] "me"   "same"

length( grep ("ENSG", rownames( mat) ) )

## [1] 20689

length( grep ("ENSG", rownames( mat) ) ) == nrow( mat )

## [1] TRUE

summary( mat[,1:4] )

##    R1L1.HSM1          R1L2.PTF1          R1L3.RMM1       
##  Min.   :    0.00   Min.   :     0.0   Min.   :     0.0  
##  1st Qu.:    0.00   1st Qu.:     0.0   1st Qu.:     0.0  
##  Median :    3.00   Median :     5.0   Median :     5.0  
##  Mean   :   65.08   Mean   :   128.9   Mean   :   128.4  
##  3rd Qu.:   30.00   3rd Qu.:    49.0   3rd Qu.:    48.0  
##  Max.   :91116.00   Max.   :235597.0   Max.   :304723.0  
##    R1L4.HSF1        
##  Min.   :     0.00  
##  1st Qu.:     0.00  
##  Median :     4.00  
##  Mean   :    80.53  
##  3rd Qu.:    35.00  
##  Max.   :276431.00

Let’s have a brief look at the data distribution.

# automatically draws boxplots for the columns
boxplot( mat, col=8 )

The data look very skewed, overall. This is one thing we’ll have to deal with.

To further obtain an overview, we can use data reduction (clustering, PCA, MDS, etc).

These give a feeling of the main sources of variance.
These can be used to identify possible outliers and other possible abnormalities in the data.

We’ll start by using hierarchical clustering. This is a simple grouping algorithm, applied in the R function hclust, which also implements UPGMA.

Now we wish to calculate euclidean distance among the samples (columns) in the expression matrix:

distmat = dist( t( mat ), method = "euclidean") 
dim( distmat )

## NULL

plot( hclust( distmat ) )

Can we detect any pattern on the tree? Was the experiment a waste?

DEFINE CATEGORICAL FACTORS

To further study the data, we need to know which column corresponds to which species, sex, sequencing run.

All the information encoded in column names, e.g.:

**R1L1.HSM1**

This indicates, respectively:

sequencing run no.,
sequencing lane no.,
species (HS: Homo sapiens, PT: Pan troglodytes, RM: rhesus macaque),
sex,
indv. ID.

How to extract that info?

# use the substr() function 
x = colnames(mat)[1]
substr( x, 1, 2 )

## [1] "R1"

substr( x, 6, 7 )

## [1] "HS"

# now extract species IDs from column names, and create a new object called "species" 
# substr also runs loops along vectors
species = substr ( colnames(mat), 6, 7 ) 
# check that it worked
species

##  [1] "HS" "PT" "RM" "HS" "PT" "RM" "RM" "HS" "PT" "RM" "HS" "PT" "RM" "HS"
## [15] "PT" "RM" "HS" "PT" "RM" "HS" "HS" "RM" "PT" "PT" "RM" "HS" "RM" "HS"
## [29] "PT" "RM" "RM" "PT" "PT" "PT" "HS" "HS"

# frequencies
table( species )

## species
## HS PT RM 
## 12 12 12

Now do the same for sex and individuals:

sex = substr( colnames(mat), 8, 8)
table( sex )

## sex
##  F  M 
## 18 18

indv = substr( colnames(mat), 6, 9)
table( indv )

## indv
## HSF1 HSF2 HSF3 HSM1 HSM2 HSM3 PTF1 PTF2 PTF3 PTM1 PTM2 PTM3 RMF1 RMF2 RMF3 
##    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2 
## RMM1 RMM2 RMM3 
##    2    2    2

Draw and lable trees using a loop (Optional exercise)

Now check if in the tree you draw, samples cluster according to species, individuals, sex, or run?

Use a loop, and use the labels parameter of plot. But for that, you have to first create a list to run over:

myvariables = list( species, sex, indv )
names(myvariables) = c("species", "sex", "indv")
# define a 3X1 window
par(mfrow=c(3,1))
# and run your loop to fill the window
for( varx in myvariables) {
  plot( hclust( distmat ), labels = varx )
}

So again, no obvious clustering w.r.t. species, but only the individual ID, i.e. technical replicates.

SUMMARIZE ACROSS TECHNICAL REPLICATES

Can we treat replicates as independent samples in downstream analyses? How would this affect DE analyses?

# demo of pseudoreplication
x = 1:5
y = 3:6
t.test(x, y) # not sig

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -1.5667, df = 6.9808, p-value = 0.1613
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.7652203  0.7652203
## sample estimates:
## mean of x mean of y 
##       3.0       4.5

t.test(rep(x, 2), rep(y, 2))  # sig

## 
##  Welch Two Sample t-test
## 
## data:  rep(x, 2) and rep(y, 2)
## t = -2.3694, df = 15.996, p-value = 0.03074
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.842104 -0.157896
## sample estimates:
## mean of x mean of y 
##       3.0       4.5

Treating replicates as independent observations is called pseudoreplication and needs to be avoided. How to remove the replicate effect?

We can include in individual effect in the statistical model, as a “random effect” (superior),
We can take the average per individual from the start (not so nice, but simple).
We can remove one randomly (worst, because makes the data least reproducible).

Unless we include the individual into the model as a random variable, we can summarize by taking the mean per replicate pair:

z = indv[1]
head(mat[, indv == z] )

##                 R1L1.HSM1 R5L2.HSM1
## ENSG00000000003        60        61
## ENSG00000000005         0         0
## ENSG00000000419        17        22
## ENSG00000000457        50        64
## ENSG00000000460         9         6
## ENSG00000000938        32        41

z_expr = rowMeans( mat[, indv == z] )
length( z_expr )

## [1] 20689

# double check that it worked
head( cbind( mat[, indv == z], z_expr) )

##                 R1L1.HSM1 R5L2.HSM1 z_expr
## ENSG00000000003        60        61   60.5
## ENSG00000000005         0         0    0.0
## ENSG00000000419        17        22   19.5
## ENSG00000000457        50        64   57.0
## ENSG00000000460         9         6    7.5
## ENSG00000000938        32        41   36.5

Now let’s try create a new matrix mat2 where replicates are summarized, using sapply:

mat2 = sapply( unique( indv ), function(x) { 
  rowMeans( mat[, indv == x] )
})
dim( mat2 )

## [1] 20689    18

head( mat2 )

##                 HSM1  PTF1  RMM1  HSF1  PTM1  RMF1  RMF2  HSM2  PTF2  RMM2
## ENSG00000000003 60.5 287.0 212.0 164.5 204.5 255.5 232.0 210.5 214.5 674.0
## ENSG00000000005  0.0   0.5   1.0   0.0   0.0   0.0   0.5   0.5   2.0   0.0
## ENSG00000000419 19.5  50.0  24.5  40.5  29.5  33.5  36.0  39.0  33.0  41.0
## ENSG00000000457 57.0  65.5  58.5  45.5 137.0  44.5  34.5  36.5 114.5  57.0
## ENSG00000000460  7.5   4.0   3.5   3.0   6.0   2.0   2.0   3.5   4.0   1.5
## ENSG00000000938 36.5  42.0  40.0  22.0  99.0  36.5  14.0  21.5  52.0  17.0
##                  HSF2  PTM2  RMM3  RMF3  HSM3  PTF3  PTM3 HSF3
## ENSG00000000003 150.0 342.5 352.5 240.0 179.5 193.0 214.0   84
## ENSG00000000005   0.0   0.5   0.0   1.0   0.0   0.5   0.5    0
## ENSG00000000419  30.5  29.0  45.0  28.5  31.0  32.5  46.5   28
## ENSG00000000457  31.0 119.5  81.5  59.0  47.5  43.5  74.5   38
## ENSG00000000460   8.5   3.5   2.0   5.0   8.5   2.5   5.0    6
## ENSG00000000938  32.5  15.5  36.0  27.0  32.5  38.0  29.0  105

We now need to add the column names back.

Note that here we assume sapply did not reorder the vector - which is the case in R:

colnames( mat2 ) = unique( indv )
head( mat2 )

##                 HSM1  PTF1  RMM1  HSF1  PTM1  RMF1  RMF2  HSM2  PTF2  RMM2
## ENSG00000000003 60.5 287.0 212.0 164.5 204.5 255.5 232.0 210.5 214.5 674.0
## ENSG00000000005  0.0   0.5   1.0   0.0   0.0   0.0   0.5   0.5   2.0   0.0
## ENSG00000000419 19.5  50.0  24.5  40.5  29.5  33.5  36.0  39.0  33.0  41.0
## ENSG00000000457 57.0  65.5  58.5  45.5 137.0  44.5  34.5  36.5 114.5  57.0
## ENSG00000000460  7.5   4.0   3.5   3.0   6.0   2.0   2.0   3.5   4.0   1.5
## ENSG00000000938 36.5  42.0  40.0  22.0  99.0  36.5  14.0  21.5  52.0  17.0
##                  HSF2  PTM2  RMM3  RMF3  HSM3  PTF3  PTM3 HSF3
## ENSG00000000003 150.0 342.5 352.5 240.0 179.5 193.0 214.0   84
## ENSG00000000005   0.0   0.5   0.0   1.0   0.0   0.5   0.5    0
## ENSG00000000419  30.5  29.0  45.0  28.5  31.0  32.5  46.5   28
## ENSG00000000457  31.0 119.5  81.5  59.0  47.5  43.5  74.5   38
## ENSG00000000460   8.5   3.5   2.0   5.0   8.5   2.5   5.0    6
## ENSG00000000938  32.5  15.5  36.0  27.0  32.5  38.0  29.0  105

We also need to define factors again. Using new object names helps prevent confusion.

We may also convert the vectors to R class factor. This data type is useful for defining categories & running ANOVAs, plotting boxplots, etc.

species2 = factor( substr( colnames( mat2 ), 1, 2 ) )
sex2 = factor( substr( colnames( mat2 ), 3, 3 ) )
table(species2, sex2)

##         sex2
## species2 F M
##       HS 3 3
##       PT 3 3
##       RM 3 3

DATA TRANSFORMATION

How may data transformation help? To see the idea, check the histogram of the full data (or the first column if you like):

hist( mat2, col=5, xlab='counts' )

# convert into a vector
x = c( mat2 ) 
length(x)

## [1] 372402

Super right-skewed! One approach commonly applied to right-skewed data is to take the logarithm (or taking the square root, as the authors did):

hist( log2(x), col=5, xlab='counts' )

head( log2(x) )

## [1] 5.918863     -Inf 4.285402 5.832890 2.906891 5.189825

0s are converted into -Inf. What to do?

hist( log2(x + 1), col=5, xlab='counts' )

A nice “mosque plot”. What does the minarette represent? And the dome?

Transformation has two main uses:

It helps viewing / interpreting the distribution when data are right-skewed. It squeezes the large values together and spreads out the low values.
It helps to stabilize the variance, i.e. weakens or removes the correlation between variance & mean that you find in transcriptome data (which has large dynamic range). This correlation appears across high & low expressed genes, and for each gene, between low and high expressed groups. This is important because many statistical tests assume comparable variance among groups (e.g. ANOVA).

Caution: There exist researchers who vehemently oppose transformation (or normalization), and instead prefer using appropriate statistical methods that take into account the specific nature of the data (e.g. using the negative binomal distribution).

Stabilizing variance by log transformation (Optional exercise)

Let’s check the effect of variance stabilizing via log transformation on this dataset. We will:

compare variance vs mean of each gene using the raw data,
then repeat this using log-transformed data,

It might make sense to calculate variance only using one species, e.g. the human subset of the data. We can call it hmat. Why only human? In this way, variance only represents technical/biological noise, but not evolutionary divergence.

# create the subset
hmat2 = mat2[, species2=="HS"]
lhmat2 = log2(hmat2+1)
# this would also work
hmat2 = subset(mat2, select = species2=="HS")
dim(hmat2)

## [1] 20689     6

head(hmat2)

##                 HSM1  HSF1  HSM2  HSF2  HSM3 HSF3
## ENSG00000000003 60.5 164.5 210.5 150.0 179.5   84
## ENSG00000000005  0.0   0.0   0.5   0.0   0.0    0
## ENSG00000000419 19.5  40.5  39.0  30.5  31.0   28
## ENSG00000000457 57.0  45.5  36.5  31.0  47.5   38
## ENSG00000000460  7.5   3.0   3.5   8.5   8.5    6
## ENSG00000000938 36.5  22.0  21.5  32.5  32.5  105

rm = rowMeans(hmat2)
# this would also have worked
rm = apply(hmat2, 1, mean)  # means per gene
rv = apply(hmat2, 1, var)  # variance per gene

Now plot the relationship between mean and variance, using a scatter plot:

plot(rm, rv, col=6, ylab='variance', xlab='mean', main='raw')

# each point is a gene

Try plotting by converting both axes to log:

plot(rm, rv, log='xy', col=6, ylab='variance', xlab='mean', main='raw')

## Warning in xy.coords(x, y, xlabel, ylabel, log): 4314 x values <= 0 omitted
## from logarithmic plot

## Warning in xy.coords(x, y, xlabel, ylabel, log): 4314 y values <= 0 omitted
## from logarithmic plot

cor(rm, rv)

## [1] 0.906999

text( x = 1e1, y = 1e8, paste("rho =", round( cor(rm, rv), 2 ) ) )

Note that only the statistics (mean and var) are log-transformed here, not the original data.

Now repeat the same analysis but using the log-tr dataset:

lrm = rowMeans(lhmat2)
lrv = apply(lhmat2, 1, var)
plot(lrm, lrv, col=6, ylab='variance', xlab='mean', main='log')
cor(lrm, lrv)

## [1] 0.1319166

text( x = 13, y = 7, paste("rho =", round( cor(lrm, lrv), 2 ) ) )

Create and study the log-transformed dataset

We now let’s log2 transform the full summarized matrix mat2, and call it mat3.

mat3 = log2 ( mat2 + 1 )  # log transform
head( mat3 )

##                     HSM1      PTF1     RMM1     HSF1     PTM1     RMF1
## ENSG00000000003 5.942515 8.1699250 7.734710 7.370687 7.682995 8.002815
## ENSG00000000005 0.000000 0.5849625 1.000000 0.000000 0.000000 0.000000
## ENSG00000000419 4.357552 5.6724253 4.672425 5.375039 4.930737 5.108524
## ENSG00000000457 5.857981 6.0552824 5.894818 5.539159 7.108524 5.507795
## ENSG00000000460 3.087463 2.3219281 2.169925 2.000000 2.807355 1.584963
## ENSG00000000938 5.228819 5.4262648 5.357552 4.523562 6.643856 5.228819
##                      RMF2      HSM2     PTF2     RMM2     HSF2      PTM2
## ENSG00000000003 7.8641861 7.7245139 7.751544 9.398744 7.238405 8.4241663
## ENSG00000000005 0.5849625 0.5849625 1.584963 0.000000 0.000000 0.5849625
## ENSG00000000419 5.2094534 5.3219281 5.087463 5.392317 4.977280 4.9068906
## ENSG00000000457 5.1497471 5.2288187 6.851749 5.857981 5.000000 6.9128893
## ENSG00000000460 1.5849625 2.1699250 2.321928 1.321928 3.247928 2.1699250
## ENSG00000000938 3.9068906 4.4918531 5.727920 4.169925 5.066089 4.0443941
##                     RMM3     RMF3     HSM3      PTF3      PTM3     HSF3
## ENSG00000000003 8.465566 7.912889 7.495855 7.5999128 7.7481928 6.409391
## ENSG00000000005 0.000000 1.000000 0.000000 0.5849625 0.5849625 0.000000
## ENSG00000000419 5.523562 4.882643 5.000000 5.0660892 5.5698556 4.857981
## ENSG00000000457 6.366322 5.906891 5.599913 5.4757334 6.2384047 5.285402
## ENSG00000000460 1.584963 2.584963 3.247928 1.8073549 2.5849625 2.807355
## ENSG00000000938 5.209453 4.807355 5.066089 5.2854022 4.9068906 6.727920

boxplot( mat3, las=2)

And now let’s draw the tree again. Would you expect a different topology just because of log transformation?

plot( hclust( dist( t( mat3 ), method = "euclidean")  ), main="log tr")

Abracadabra! Now we start observing the phylogenetic signal. How can you explain this effect?

Here is a hypothesis: Many genes in the transcriptome contains phylogenetic signal. However, some very highly expressed genes are prone to noise and they do not contain phylogenetic signal. By log transforming we downweight these genes and thereby reveal the signal.

Check the high expression bias hypothesis (Optional exercise)

We can test this hypothesis simply by removing genes with high expression:

plot( hclust( dist( t( mat2[rowMeans(mat2) < 100, ] ), method = "euclidean")  ), main="log tr")

Indeed, those few genes with high expression must be the culprit.

FILTERING

Many studies remove any a priori unreliable components of the dataset. This is for 2 reasons:

Statistics from low expressed genes are more noisy (because of sampling error), and may not be reproducible. Filtering these out improves reproducibility.
Filtering improves power by limiting the number of tests where we have low power due to low signal/noise ratio (as we’ll learn when discussing multiple testing).

We may thus remove low expressed genes as commonly done. But there is no consensus cutoff for what is low expression. We could try to come up with an educated cutoff from the expression level histograms, or keep it simple.

First check how many genes contain how many 0’s:

hist( rowSums( mat3 == 0 ), col = "gray", main =  "", xlab = "Zeros per gene")

Here we will simply remove any gene including 0’s. However, you could use more sophisticated approaches.

filter = ( rowSums( mat3 == 0 ) == 0 )
sum(filter)

## [1] 10644

fmat3 = mat3 [filter, ]
dim( fmat3 )

## [1] 10644    18

A complex filter function (Optional exercise)

Filter the dataset to ensure that in all 3 species >50% of samples should have >0 expression. Do this by writing a function that receives the gene’s expression levels, the cutoff, and the species vector (so that you could use this function on other datasets in the future).

# need to go step by step. first try the 2nd gene (contains some 0s)
y = mat3[2,]
y

##      HSM1      PTF1      RMM1      HSF1      PTM1      RMF1      RMF2 
## 0.0000000 0.5849625 1.0000000 0.0000000 0.0000000 0.0000000 0.5849625 
##      HSM2      PTF2      RMM2      HSF2      PTM2      RMM3      RMF3 
## 0.5849625 1.5849625 0.0000000 0.0000000 0.5849625 0.0000000 1.0000000 
##      HSM3      PTF3      PTM3      HSF3 
## 0.0000000 0.5849625 0.5849625 0.0000000

y_sp = y[ species2 == "HS" ]
y_sp

##      HSM1      HSF1      HSM2      HSF2      HSM3      HSF3 
## 0.0000000 0.0000000 0.5849625 0.0000000 0.0000000 0.0000000

mean( y_sp == 0 )

## [1] 0.8333333

CUTOFF = 0.5
proportionx = mean( y_sp == 0 )
# is the proportion of indv with expr 0 >= 0.5
proportionx >= CUTOFF

## [1] TRUE

# is the gene not detected in human, chimp, and macaque?  
notdet = sapply( unique(species2), function(spx) {
  y_sp = y[ species2 == spx ]
  proportionx = mean( y_sp == 0 )
  proportionx >= CUTOFF
})
notdet

## [1]  TRUE FALSE  TRUE

# is the gene detected in all species?
sum( notdet ) == 0

## [1] FALSE

Now we can write the above as a function (just for exercise):

detfunc = function(y, CUTOFF, speciesvector) {
      notdet = sapply( unique(speciesvector), function(spx) {
      y_sp = y[ speciesvector == spx ]
      proportionx = mean( y_sp == 0 )
      proportionx >= CUTOFF
    })
    notdet
    # is the gene detected in all species?
    sum( notdet ) == 0 
}
# is gene no.2 detected in all species?
detfunc(y, CUTOFF = 0.5, speciesvector = species2)

## [1] FALSE

Now we apply to all genes, and create a boolean vector called pass that has TRUEs for genes only passing the cutoff:

pass = apply( mat3, 1, function(y) { 
  detfunc(y, CUTOFF = 0.5, speciesvector = species2) 
})
head(pass)

## ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 
##            TRUE           FALSE            TRUE            TRUE 
## ENSG00000000460 ENSG00000000938 
##            TRUE            TRUE

length(pass)

## [1] 20689

sum(pass)

## [1] 12357

mean(pass)

## [1] 0.5972739

You could now use this vector as filter, to subset the original matrix.

NORMALIZATION

Because the same amount of RNA is supposed to have been used, the total amount of signal for all samples should be the same (assuming mRNA/total RNA ratios are the same). Any difference we find among samples in total counts probably represents technical variance. Many researchers aim to scale this out, to avoid confounding with biological effects and improve signal/noise ratios. Normalization is standard in microarrays, and also implemented when calculating RPKM, FPKM and TPM in RNA-sequencing.

The motivation for normalizing across columns (Optional exercise)

Now let’s study the distributions across columns again using boxplot. We shall continue by using the filtered matrix.

boxplot( fmat3, las=2, col=7, ylab='Expression level')

The samples present obvious differences in their distributions. Were the different samples sequenced to similar depth? Is there any confounding with species or sex? We could study this, by calculating the sum over the columns:

# total log counts 
tc = colSums(fmat3)
tc

##     HSM1     PTF1     RMM1     HSF1     PTM1     RMF1     RMF2     HSM2 
## 52349.71 59008.90 58138.31 54344.03 55329.22 57320.08 53792.36 56834.98 
##     PTF2     RMM2     HSF2     PTM2     RMM3     RMF3     HSM3     PTF3 
## 55465.19 51464.26 53839.75 55575.30 56704.36 60936.30 56051.84 55312.72 
##     PTM3     HSF3 
## 53829.89 54550.42

hist( tc, col=8, main = "total counts")

# redefine the window
par(mfrow=c(1,1))
hist( tc, col=5)

spnames = c('Human', 'Chimp', 'Macaque')
# plot against species
boxplot( tc ~ species2,
         ylab='Total lane counts', 
         col=2:4, 
         names=spnames)

# plot with interaction
boxplot( tc ~ species2 + sex2,
         ylab='Total counts', col=2:4)

Are these differences significant? We could use the KW test:

kruskal.test( tc ~ species2 )$p.val

## [1] 0.5475294

kruskal.test( tc ~ sex2 )$p.val

## [1] 0.7572777

We could also use ANOVA:

summary( aov( tc ~ species2 + sex2  ) )

##             Df   Sum Sq Mean Sq F value Pr(>F)
## species2     2  9192158 4596079   0.810  0.465
## sex2         1  3819748 3819748   0.673  0.426
## Residuals   14 79457026 5675502

Not observing a major effect is a good sign. But still, there is up to 20% difference in log counts.

As we mentioned, difference in total counts probably represents technical variance. Many authors aim to scale this out, to avoid confounding with biological effects and improve signal/noise ratios.

How to correct?

We can imagine different ways:

We could consider centering to the same mean by subtracting the mean of each column. But this is not realistic.
Alternatively, we can divide by the column sums, which is what the authors of the original paper did.

Normalizing by scaling (Optional exercise)

sfmat3 = apply( fmat3, 2, function(x) { x / sum(x) })   # divide by the column sum
# this is the same:
sfmat3 = t(t( fmat3 ) / colSums(fmat3))
dim(sfmat3)

## [1] 10644    18

head(sfmat3[,1:6])

##                         HSM1         PTF1         RMM1         HSF1
## ENSG00000000003 1.135157e-04 1.384524e-04 1.330398e-04 1.356301e-04
## ENSG00000000419 8.323927e-05 9.612830e-05 8.036742e-05 9.890764e-05
## ENSG00000000457 1.119009e-04 1.026164e-04 1.013930e-04 1.019276e-04
## ENSG00000000460 5.897765e-05 3.934878e-05 3.732350e-05 3.680257e-05
## ENSG00000000938 9.988247e-05 9.195672e-05 9.215184e-05 8.323936e-05
## ENSG00000000971 2.065834e-04 1.930974e-04 2.201209e-04 2.064416e-04
##                         PTM1         RMF1
## ENSG00000000003 1.388596e-04 1.396163e-04
## ENSG00000000419 8.911634e-05 8.912277e-05
## ENSG00000000457 1.284769e-04 9.608840e-05
## ENSG00000000460 5.073910e-05 2.765109e-05
## ENSG00000000938 1.200786e-04 9.122141e-05
## ENSG00000000971 1.866564e-04 2.220499e-04

We can further multiply by a constant to shift the values to their original range

sfmat3 = sfmat3 * mean(colSums(fmat3))
head(sfmat3[,1:6])

##                      HSM1      PTF1      RMM1      HSF1      PTM1
## ENSG00000000003  6.311774  7.698321  7.397366  7.541394  7.720963
## ENSG00000000419  4.628324  5.344988  4.468641  5.499527  4.955104
## ENSG00000000457  6.221987  5.705745  5.637720  5.667447  7.143654
## ENSG00000000460  3.279313  2.187896  2.075285  2.046320  2.821229
## ENSG00000000938  5.553730  5.113037  5.123886  4.628329  6.676689
## ENSG00000000971 11.486585 10.736724 12.239307 11.478698 10.378590
##                      RMF1
## ENSG00000000003  7.763033
## ENSG00000000419  4.955462
## ENSG00000000457  5.342769
## ENSG00000000460  1.537474
## ENSG00000000938  5.072152
## ENSG00000000971 12.346564

colSums(sfmat3)

##     HSM1     PTF1     RMM1     HSF1     PTM1     RMF1     RMF2     HSM2 
## 55602.65 55602.65 55602.65 55602.65 55602.65 55602.65 55602.65 55602.65 
##     PTF2     RMM2     HSF2     PTM2     RMM3     RMF3     HSM3     PTF3 
## 55602.65 55602.65 55602.65 55602.65 55602.65 55602.65 55602.65 55602.65 
##     PTM3     HSF3 
## 55602.65 55602.65

Let us now compare the data, before and after normalization:

boxplot( fmat3, las=2, col=7, ylab='Expression level')

boxplot( sfmat3, las=2, col=7, ylab='Expression level')

plot( hclust( dist( t( fmat3 ), method = "euclidean")  ) )

plot( hclust( dist( t( sfmat3 ), method = "euclidean")  ) )

Quantile normalization

We can also use a more powerful approach, quantile normalization. This is an alternative method which makes distributions exactly the same.

source("https://bioconductor.org/biocLite.R")
biocLite("preprocessCore")

We can use the Bioconductor package preprocessCores function for that. See http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/practicals/ASG1_2012/affynorm_slides/index.html#/step-1 for more.

library(preprocessCore, verbose = F, quietly = T)

nfmat3 = normalize.quantiles( fmat3 )
# we need to redefine, because the function removes the names
colnames( nfmat3 ) = colnames( fmat3 )  
rownames( nfmat3 ) = rownames( fmat3 )

boxplot( nfmat3, las=2, col=7, ylab='Expression level', main='Quantile' )

plot( hclust( dist( t( nfmat3 ), method = "euclidean")  ), main='Quantile' )

Quantile normalization could improve sensitivity but only if expression differences between samples are of limited magnitude and affect few genes. If you are comparing very different transcriptomes, e.g. different tissues, q.n. can also distort distributions, and create spurious results.

In case you cannot be sure which approach, q.n. or dividing by the column sums, is superior, can run downstream tests using both methods, to ensure your choice of normalization does not have a large influence.

In general, it makes more sense to first filter, and then scale/normalize. Otherwise the filtering will distort the distributions again.

PRINCIPLE COMPONENTS ANALYSIS

Before going into DE, one last check on the data is always helpful.

PCA is a powerful method:

It summarizes variance in multiple orthogonal dimensions, which you can imagine as main contributing genes,
This can help in identifying structure and abnormalities in the data, including possible outlier samples, which we may wish to remove before going downstream,
PCs (eigenvectors) can be used to remove sources of technical variance, as in surrogate variable analysis.

PCA in R (Optional exercise)

# this summarizes variance across genes, first scaling them to the same variance
pc = prcomp(  t( nfmat3 ), scale=T)

# info on the first 6 principle components
summary( pc )$imp[, 1:6 ]

##                             PC1      PC2      PC3      PC4      PC5
## Standard deviation     52.50328 40.46251 34.68010 27.40859 25.50359
## Proportion of Variance  0.25898  0.15382  0.11299  0.07058  0.06111
## Cumulative Proportion   0.25898  0.41280  0.52579  0.59637  0.65748
##                             PC6
## Standard deviation     24.07841
## Proportion of Variance  0.05447
## Cumulative Proportion   0.71195

Now let’s plot:

plot( pc$x[,1:2] )

But who is who? Use text() to add the sample names on the plot:

plot( pc$x[,1:2] )
text( pc$x[,1:2], names(pc$x[,1]) )

How about PC3 and PC4?

plot( pc$x[,3:4] )
text( pc$x[,3:4], names(pc$x[,1]) )

What’s wrong with that macaque? Remove or not remove?

Here we choose not the remove. But we might have tried both ways, to ensure that our results are not affected by the inclusion or exclusion of that individual.

Nicer plots (Optional exercise)

But before continuing, let’s try to make our PCA look nicer:

spcols = gsub("RM", 4, 
              gsub("PT", 3, 
                   gsub("HS", 2, species2 ) ) )

sexcols = gsub("F", 5, 
                gsub("M", 6, sex2 ) )

plot( pc$x[,1:2], pch=19, col = spcols, cex=1.5)
legend( -10, 10, c('Human', 'Chimp', 'Macaque'), fill = 2:4, bty = "n")

plot( pc$x[,1:2], pch=19, col = sexcols, cex=1.5)
legend( -10, 10, c('Female', 'Male'), fill = 5:6, bty = "n")

# we can also write loops to plot other PCs against one another
plot( pc$x[,3:4], pch=19, col = spcols, cex=1.5)
legend( 50, 0, c('Human', 'Chimp', 'Macaque'), fill = 2:4, bty = "n")

plot( pc$x[,3:4], pch=19, col = sexcols, cex=1.5)
legend( 50, 0, c('Female', 'Male'), fill = 5:6, bty = "n")

TESTING THE NEUTRAL MODEL

Already the observation that the average gene’s expression reflects the phylogeny, but not the species’ environments is in line with the neutral expectation. Another characteristic one could check would be whether genes showing highest divergence among species, are also the ones showing highest diversity. We could test this simply:

divergence = apply(nfmat3, 1, function(y)  {
  y1 = split(y, species2)
  var(sapply(y1, mean))
})  
diversity = apply(nfmat3, 1, function(y)  {
  y1 = split(y, species2)
  mean(sapply(y1, var))
})  
cor.test(divergence, diversity, method="spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  divergence and diversity
## S = 1.2317e+11, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.3871478

plot(divergence, diversity)
# you can also add a regression line
abline( lm(diversity ~ divergence) )

This result also conforms with what we would expect under neutrality (Khaitovich et al. 2005 Science).

TESTING DIFFERENTIAL EXPRESSION USING LINEAR MODELS

Assumptions for linear models:

data points are independent and randomly sampled from the population,
the variable is normally distributed,
residuals are homogeneous (same variance among groups).

Now try to build a linear model of sex and species effects, and their interaction.

# choose the first gene
y = nfmat3[1,]
boxplot( y ~ species2, col=2:4)

boxplot( y ~ species2 + sex2, col=2:4)

# aov (analysis of variance) fits the model and returns a few statistics
aov( y ~ species2 )

## Call:
##    aov(formula = y ~ species2)
## 
## Terms:
##                 species2 Residuals
## Sum of Squares  2.629176  4.423199
## Deg. of Freedom        2        15
## 
## Residual standard error: 0.5430285
## Estimated effects may be unbalanced

# testing with ANOVA
summary( aov( y ~ species2 ) )

##             Df Sum Sq Mean Sq F value Pr(>F)  
## species2     2  2.629  1.3146   4.458 0.0302 *
## Residuals   15  4.423  0.2949                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary( aov( y ~ species2 + sex2 ) )

##             Df Sum Sq Mean Sq F value Pr(>F)  
## species2     2  2.629  1.3146   4.528 0.0304 *
## sex2         1  0.359  0.3590   1.237 0.2848  
## Residuals   14  4.064  0.2903                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# and to obtain the p-values for species and sex:
anova( aov( y ~ species2 + sex2 ) )$Pr[1:2]

## [1] 0.03043022 0.28484020

You can further test for interaction between species and sex:

summary( aov( y ~ species2 * sex2 ) )

##               Df Sum Sq Mean Sq F value Pr(>F)  
## species2       2  2.629  1.3146   4.258  0.040 *
## sex2           1  0.359  0.3590   1.163  0.302  
## species2:sex2  2  0.360  0.1798   0.582  0.574  
## Residuals     12  3.705  0.3087                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA is said to be robust to violations of assumptions (of normality and equal variance) when sample sizes are large and groups are balanced. Our group sizes are balanced, but sample sizes are not large. We could possible test whether the data conform with the expectations (for each gene separately), or we could use alternative methods that have fewer assumptions.

Testing the validity of assumptions (Optional exercise)

Still, we can check whether the assumptions are satistified or not. Let’s learn how to check with the first gene:

x = split( y, species2 )
x

## $HS
##     HSM1     HSF1     HSM2     HSF2     HSM3     HSF3 
## 6.272064 7.550515 7.623875 7.458442 7.477059 6.617927 
## 
## $PT
##     PTF1     PTM1     PTF2     PTM2     PTF3     PTM3 
## 7.737762 7.872090 7.895423 8.490831 7.595058 7.851797 
## 
## $RM
##     RMM1     RMF1     RMF2     RMM2     RMM3     RMF3 
## 7.479539 7.854719 7.964594 9.248657 8.275292 7.374670

# is the data normally distributed, for human?
i = 1
shapiro.test(x[[i]])$p.val

## [1] 0.04069796

# are the variances equal, between human and chimp?
j = 2
var.test(x[[i]], x[[j]])$p.val

## [1] 0.197443

# now test this across all species + species combinations:
sapply(1:3, function(i) shapiro.test(x[[i]])$p.val )

## [1] 0.04069796 0.09013242 0.34031424

sapply(1:3, function(i) sapply(1:3, function(j) var.test(x[[i]], x[[j]])$p.val ))

##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.1974430 0.7148993
## [2,] 0.1974430 1.0000000 0.1056206
## [3,] 0.7148993 0.1056206 1.0000000

We could also consider different approaches, with and without ANOVA. Here we’ll just provide a few examples:

# interaction
summary( aov( y ~ species2 * sex2 ) )

##               Df Sum Sq Mean Sq F value Pr(>F)  
## species2       2  2.629  1.3146   4.258  0.040 *
## sex2           1  0.359  0.3590   1.163  0.302  
## species2:sex2  2  0.360  0.1798   0.582  0.574  
## Residuals     12  3.705  0.3087                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# different ways to calculate the p-value for the species effect
anova( aov( y ~ sex2 ), aov( y ~ species2 + sex2 ), test = "Chisq") # comparing two models with X2

anova( aov( y ~ sex2 ), aov( y ~ species2 + sex2 ), test = "F")  # comparing 2 models with the F-test

# you could also use a non-parametric test
kruskal.test( y ~ species2 )

## 
##  Kruskal-Wallis rank sum test
## 
## data:  y by species2
## Kruskal-Wallis chi-squared = 7.4035, df = 2, p-value = 0.02468

Using a random effect model (Optional exercise)

A more sophisticated way of using ANOVA in this dataset could be using the raw data. We can then include “individual” as a random effect: A factor where the specific categories are not of interest, but the effect itself is of interest. For example, we could have used different individuals in this model and still have estimated the same “individual effect”. In contrast, species and sex are fixed effects. Not every species (or sex, if it existed) would behave the same.

Note: The below is just for demonstration, as the data is not normalized / scaled. Also note that packages like DESeq2 do not use the negative binomial the way it is used below.

w = as.numeric( log2(mat[1,]+1) )  # using the original data after log-tr

# this is flawed, because it ignores technical replicates (pseudoreplication)
summary( aov( w ~ species + sex ))

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## species      2  9.096   4.548  14.716 2.94e-05 ***
## sex          1  0.608   0.608   1.967     0.17    
## Residuals   32  9.890   0.309                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# this is more correct
summary( aov( w ~ species + sex + indv ))

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## species      2  9.096   4.548  139.38 1.11e-11 ***
## sex          1  0.608   0.608   18.63 0.000416 ***
## indv        14  9.303   0.664   20.36 3.31e-08 ***
## Residuals   18  0.587   0.033                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# including "individual" as a random effect is the best approach:
summary( aov( w ~ species + sex + Error(indv) ))

## 
## Error: indv
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## species    2  9.096   4.548   6.845 0.00845 **
## sex        1  0.608   0.608   0.915 0.35507   
## Residuals 14  9.303   0.664                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 18 0.5874 0.03263

Using quasipoisson and negative binomial with glm (Optional exercise)

We could also use glm, with which we can use different distributions, and use weights.

The negative binomial and the quasipoisson distributions are similar to the Poisson, but the variance (“overdispersion”) is independent of the mean, and thus are more flexible alternatives for modeling count data.

First let’s use the quasipoisson() with the built-in glm function. We use the summarized dataset here, but not log-transformed:

z = mat2[1,]  
anova( glm( z ~ species2 + sex2, family = quasipoisson() ), glm( z ~ sex2, family = quasipoisson() ), test = "Chisq")

We can also use the negative binomial with the glm.nb function from the MASS library:

library(MASS, verbose = F, quietly = T)

db = data.frame(z, sex2, species2)
anova( glm.nb( z ~ species2 + sex2, data = db), glm.nb( z ~  sex2, data = db))

Ver Hoef & Boveng (2007, Ecology) explain the difference between the two distributions:

“The variance of a quasi-Poisson model is a linear function of the mean while the variance of a negative binomial model is a quadratic function of the mean.”

So in reality, depending on the variance structure (which may be different for each gene), you may need different models. But this is not very practical, at least here. For now, we’ll stick to ANOVA.

ANOVA on the full dataset:

Let’s run the test, with interaction, on all expressed genes:

nfmat3_aovp = apply(nfmat3, 1, function(y)
  anova( aov( y ~ species2 * sex2 ) )$Pr[1:3] )

dim(nfmat3_aovp)

# sapply by default adds the result of each loop as a new column
# therefore you should transpose to keep the genes in the rows 
nfmat3_aovp = t(nfmat3_aovp)

# add back the column names
colnames( nfmat3_aovp ) = c('sp', 'sex', 'int')

# you may save the data for later use
save( nfmat3_aovp, file = "nfmat3_aovp.Rdata")

head(nfmat3_aovp)

##                          sp        sex       int
## ENSG00000000003 0.040036605 0.30205762 0.5735498
## ENSG00000000419 0.842938544 0.93412736 0.5732420
## ENSG00000000457 0.008793507 0.02530931 0.5410162
## ENSG00000000460 0.022072219 0.36588480 0.7039392
## ENSG00000000938 0.382809048 0.82457926 0.5444900
## ENSG00000000971 0.004766938 0.15989154 0.8427885

Overview of ANOVA results

Now check that the test caught reasonable signals. For this we can study the most significant gene for the species effect by plotting the expression values in a boxplot:

i = which.min ( nfmat3_aovp[,'sp'] )
y = nfmat3 [i, ]
boxplot ( y ~ species2 * sex2, col=2:4, ylab='Expression' )

Makes sense. Now repeat the same for sex, and for interaction:

y = nfmat3 [which.min ( nfmat3_aovp[,'sex'] ), ]
boxplot ( y ~ species2 * sex2, col=2:4, ylab='Expression', las=2)

y = nfmat3 [which.min ( nfmat3_aovp[,'int'] ), ]
boxplot ( y ~ species2 * sex2, col=2:4, ylab='Expression', las=2)

And how many such nominally significant genes are there?

# histogram of the data
hist( nfmat3_aovp[,"sp"], xlab='p', col=8, main="sp" )

# proportion of with cases p<0.05
apply(nfmat3_aovp < 0.05, 2, mean)

##         sp        sex        int 
## 0.55646374 0.06031567 0.04631717

Now try to plot all 3 distributions in one go:

par( mfrow=c(1,3))
for (i in 1:3) {
  hist( nfmat3_aovp[,i], xlab='p', col=rainbow(5)[i], main=colnames(nfmat3_aovp)[i])
}

But the question is whether these p-values are really non-random, and have biological meaning? Or would we expect similar distributions even if there were no real e.g. sex effect on the transcriptome.

Checking most and least significant results with a loop (Optional exercise):

Let’s repeat the same above using a loop:

par(mfrow=c(1,3))
# most significant genes
apply(nfmat3_aovp, 2, function(x)  {
  y = nfmat3 [which.min ( x ), ]
  boxplot ( y ~ species2 * sex2, col=2:4, ylab='Expression', las=2) 
})

## $sp
## $sp$stats
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 7.764023 7.742963 4.640557 7.839701 8.212784 5.031220
## [2,] 7.856489 7.819193 4.750790 7.971116 8.350384 5.148218
## [3,] 7.948955 7.895423 4.861023 8.102530 8.487984 5.265215
## [4,] 8.121187 8.030776 4.907992 8.215001 8.641155 5.318783
## [5,] 8.293420 8.166129 4.954961 8.327472 8.794325 5.372350
## 
## $sp$n
## [1] 3 3 3 3 3 3
## 
## $sp$conf
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 7.707494 7.702414 4.717621 7.880055 8.222739 5.109623
## [2,] 8.190416 8.088432 5.004424 8.325006 8.753229 5.420807
## 
## $sp$out
## numeric(0)
## 
## $sp$group
## numeric(0)
## 
## $sp$names
## [1] "HS.F" "PT.F" "RM.F" "HS.M" "PT.M" "RM.M"
## 
## 
## $sex
## $sex$stats
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 6.677811 6.521162 6.406002 6.134205 5.356916 6.349188
## [2,] 6.697915 6.588142 6.556798 6.203135 5.402289 6.390073
## [3,] 6.718019 6.655123 6.707593 6.272064 5.447661 6.430959
## [4,] 6.808994 6.823438 6.751845 6.355063 5.588213 6.457184
## [5,] 6.899968 6.991753 6.796096 6.438061 5.728765 6.483409
## 
## $sex$n
## [1] 3 3 3 3 3 3
## 
## $sex$conf
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 6.616692 6.440483 6.529669 6.133473 5.278058 6.369740
## [2,] 6.819347 6.869762 6.885518 6.410656 5.617264 6.492179
## 
## $sex$out
## numeric(0)
## 
## $sex$group
## numeric(0)
## 
## $sex$names
## [1] "HS.F" "PT.F" "RM.F" "HS.M" "PT.M" "RM.M"
## 
## 
## $int
## $int$stats
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 4.195032 3.722377 1.642879 3.964008 3.476652 5.520630
## [2,] 4.260614 4.121457 2.171478 4.075339 3.663374 5.773945
## [3,] 4.326196 4.520538 2.700076 4.186669 3.850097 6.027260
## [4,] 4.502261 4.635487 3.178690 4.540466 4.016615 6.417996
## [5,] 4.678326 4.750437 3.657305 4.894264 4.183134 6.808732
## 
## $int$n
## [1] 3 3 3 3 3 3
## 
## $int$conf
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 4.105763 4.051632 1.781283 3.762373 3.527865 5.439748
## [2,] 4.546630 4.989443 3.618869 4.610965 4.172328 6.614771
## 
## $int$out
## numeric(0)
## 
## $int$group
## numeric(0)
## 
## $int$names
## [1] "HS.F" "PT.F" "RM.F" "HS.M" "PT.M" "RM.M"

Let’s now do this for the least significant genes:

par(mfrow=c(1,3))
apply(nfmat3_aovp, 2, function(x)  {
  y = nfmat3 [which.max ( x ), ]
  boxplot ( y ~ species2 * sex2, col=2:4, ylab='Expression', las=2) 
})

## $sp
## $sp$stats
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 2.359380 1.408874 2.426599 2.359380 2.839844 1.888374
## [2,] 2.769616 1.945349 2.563337 2.403783 2.912457 2.259996
## [3,] 3.179852 2.481825 2.700076 2.448186 2.985070 2.631619
## [4,] 3.272582 2.987721 2.811813 2.761810 3.287182 3.413942
## [5,] 3.365312 3.493618 2.923550 3.075434 3.589294 4.196266
## 
## $sp$n
## [1] 3 3 3 3 3 3
## 
## $sp$conf
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 2.721039 1.530959 2.473413 2.121589 2.643240 1.578974
## [2,] 3.638664 3.432690 2.926739 2.774783 3.326899 3.684263
## 
## $sp$out
## numeric(0)
## 
## $sp$group
## numeric(0)
## 
## $sp$names
## [1] "HS.F" "PT.F" "RM.F" "HS.M" "PT.M" "RM.M"
## 
## 
## $sex
## $sex$stats
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 5.284022 4.871158 5.953486 5.363845 4.274626 5.503313
## [2,] 5.318244 5.252489 6.307972 5.473955 5.366689 5.799531
## [3,] 5.352466 5.633819 6.662458 5.584066 6.458753 6.095749
## [4,] 6.010174 5.643432 7.128758 5.812509 6.675597 6.778179
## [5,] 6.667882 5.653045 7.595058 6.040952 6.892441 7.460608
## 
## $sex$n
## [1] 3 3 3 3 3 3
## 
## $sex$conf
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 4.721278 5.277195 5.913726 5.275233 5.264750 5.203013
## [2,] 5.983653 5.990443 7.411190 5.892899 7.652756 6.988484
## 
## $sex$out
## numeric(0)
## 
## $sex$group
## numeric(0)
## 
## $sex$names
## [1] "HS.F" "PT.F" "RM.F" "HS.M" "PT.M" "RM.M"
## 
## 
## $int
## $int$stats
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 4.716572 5.700055 5.093608 4.673761 5.582378 5.591523
## [2,] 4.859242 5.838167 5.405451 4.809190 5.768331 5.715525
## [3,] 5.001911 5.976278 5.717295 4.944619 5.954284 5.839526
## [4,] 6.164437 6.724999 6.907040 5.596849 6.201818 6.072748
## [5,] 7.326962 7.473721 8.096785 6.249079 6.449352 6.305970
## 
## $int$n
## [1] 3 3 3 3 3 3
## 
## $int$conf
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 3.811295 5.167297 4.347525 4.226106 5.558851 5.513662
## [2,] 6.192527 6.785258 7.087064 5.663132 6.349717 6.165390
## 
## $int$out
## numeric(0)
## 
## $int$group
## numeric(0)
## 
## $int$names
## [1] "HS.F" "PT.F" "RM.F" "HS.M" "PT.M" "RM.M"

MULTIPLE TESTING CORRECTION

Our concern has to do with the multiple testing problem.

What is the multiple testing problem?

To see the point, let’s try a t-test on random data. Do we expect to find any significant results?

# set.seed is to make sure we produce the same random-looking numbers
set.seed(1)
t.test(runif(6), runif(6))$p.val

## [1] 0.6470367

t.test(runif(6), runif(6))$p.val

## [1] 0.3367366

t.test(runif(6), runif(6))$p.val

## [1] 0.2763599

Now run this 10000 times and store the p-values:

ranp = sapply( 1:10000, function(i) {   # here i is just a counter, no role inside the loop
  t.test(runif(6), runif(6))$p.val
})
hist( ranp, xlab='p', col=8, main="random")

What kind of distribution is this? And how many nominally significant cases do we have, with p < 0.05?

qqplot(ranp, runif(1000))

mean( ranp < 0.05 )

## [1] 0.053

As you see, p-values from random data are already expected to be uniformly distributed. Also, you must be aware that here the data is fully random, so all nominally significant cases are actually false positives.

Normally in biology, we’re OK with a false positive probability (Type I error) of 0.05. But repeating a test 1000’s of times, as in our real results, we may also expect a chunk (perhaps all) of our “significant” results to represent false positives.

Dealing with the multiple testing problem

There are two ways to address this problem:

We can use a correction algorithm implemented in the p.adjust function, such as Bonferroni, Benjamini-Hochberg (BH), or Benjamini-Yekutieli (BY).
We can use random permutations (if our sample size is large enough).

Bonferroni is the most common, and simply involves multiplying the p-values with the number of tests. This ensures that obtaining just one false positive across all tests has a probability of 0.05. This is a strict approach.

BH and BY, instead, ensure that the false positive proportion across the whole set of “nominally significant results” does not exceed 0.05.

Here we use the BY correction, which does not assume independence among tests.

nfmat3_aovpc = apply(nfmat3_aovp, 2, function(x) p.adjust( x, 'BY'))
dim(nfmat3_aovpc)

## [1] 10644     3

head(nfmat3_aovpc)

##                        sp sex int
## ENSG00000000003 0.7455430   1   1
## ENSG00000000419 1.0000000   1   1
## ENSG00000000457 0.2376753   1   1
## ENSG00000000460 0.4730437   1   1
## ENSG00000000938 1.0000000   1   1
## ENSG00000000971 0.1498150   1   1

How many genes remain significant at FDR < 0.05?

apply(nfmat3_aovpc < 0.05, 2, mean)

##        sp       sex       int 
## 0.2093198 0.0000000 0.0000000

Draw the histograms again:

par(mfrow=c(1,3))
for (i in 1:3) {
  hist( nfmat3_aovpc[,i], xlab='p', col=rainbow(5)[i], main=colnames(nfmat3_aovpc)[i])
}

This implies that 20% of the transcriptome shows a species effect, but the sex or interaction effects are not significant after correcting for multiple testing.

PERMUTATION TEST

Alternatively, you can use a Monte Carlo approach: random permute a factor, say “species” and repeat the test on the whole data. Repeating this many times, you thus simulate your own null distribution of no species effect on the transcriptome.

You can then compare your observed results with the null, and test the hypothesis that “species” has a non-random effect on the transcriptome.

The approach may be preferable over theoretical corrections because it directly takes into account any structure in the dataset (e.g. dependence among genes).

However, if the sample size is small, not enough unique factor permutations will be available and this does not work.

Implementing permutation (Optional exercise)

We randomly permute one factor (e.g. species), keeping everything else the same, and repeat the same ANOVA test procedure:

# let's try on the first gene
y = nfmat3[1,]
# the real result
anova( aov( y ~ species2 * sex2 ) )$Pr[1:3]

## [1] 0.04003661 0.30205762 0.57354977

# using sample, randomize the species ID vector
randomsp = sample( species2 )
# and run the test
anova( aov( y ~ randomsp * sex2 ) )$Pr[1:3]

## [1] 0.4922093 0.5505711 0.5968082

# randomize and test, once again
randomsp = sample( species2 )
anova( aov( y ~ randomsp * sex2 ) )$Pr[1:3]

## [1] 0.5403829 0.3498242 0.7179912

# randomize and test, once again
randomsp = sample( species2 )
anova( aov( y ~ randomsp * sex2 ) )$Pr[1:3]

## [1] 0.9879111 0.3766520 0.5235391

Now repeat this 100 times for the first 200 genes (to save time), and check how many times you find as many genes with p<0.05.

perm_nfmat3_aovp = sapply(1:100, function(i) { 
  print(i) # here i is just a counter
  randomsp = sample( species2 )  
  perm_p = 
    # use just the first 200 genes
    apply(nfmat3[1:200,], 1, function(y) {  
      # collect the species effect p-value only
      anova( aov( y ~ randomsp * sex2 ) )$Pr[1]  
    })
  return(perm_p)
})
dim(perm_nfmat3_aovp)
save(perm_nfmat3_aovp, file="perm_nfmat3_aovp.RData")

Check the first 9 cases:

par(mfrow=c(3,3))
for (i in 1:9) { 
  hist( perm_nfmat3_aovp[,i], col=8, main="", xlab="") 
}

Now check what proportion of the first 200 genes had p<0.05:

obs = mean(nfmat3_aovp[1:200,"sp"]<0.05)
obs

## [1] 0.535

exp = apply(perm_nfmat3_aovp < 0.05, 2, mean)
hist(exp, col="light blue", xlim=c(0, 0.6))
abline(v=obs, col="red")
text(0.4, 25, "no. sig.\ngenes\nobserved", col="red")

global_perm_p = mean(exp >= obs)
global_perm_p

## [1] 0

Thus we can report the global permutation test p-value as p<0.01. We can also calculate the false discovery rate using the median expectation:

perm_fdr = median(exp)/obs
perm_fdr

## [1] 0.05607477

We therefore expect c. 0.07 of the genes with p<0.05 to be false positives.

Let’s now repeat the same for sex:

perm_nfmat3_aovp2 = sapply(1:100, function(i) {  # here i is just a counter
  randomsex = sample( sex2 )  
  perm_p = 
    # use just the first 200 genes
    apply(nfmat3[1:200,], 1, function(y) {  
      # collect the sex effect p-value only
      anova( aov( y ~ species2 * randomsex ) )$Pr[2]  
    })
  return(perm_p)
})
dim(perm_nfmat3_aovp2)
save(perm_nfmat3_aovp2, file="perm_nfmat3_aovp2.RData")

Now check what proportion of the first 200 genes had p<0.05:

obs = mean(nfmat3_aovp[1:200,"sex"]<0.05)
obs

## [1] 0.08

exp = apply(perm_nfmat3_aovp2 < 0.05, 2, mean)
hist(exp, col="light blue", xlim=c(0, 0.6))
abline(v=obs, col="red")
text(0.4, 25, "no. sig.\ngenes\nobserved", col="red")

global_perm_p = mean(exp >= obs)
global_perm_p

## [1] 0.15

As you see, the overall result, in terms of genes with p<0.05 for the sex effect, is not significant.

CLUSTERING GENES USING K-MEANS

Now we will learn how to cluster genes based on their expression profiles (as opposed to clustering individuals). Our goal is to sort genes that show differential expression w.r.t. the species effect, group them according to their profiles, study the DE patterns, and then test for possible functional differences among these gene groups.

Here we can use the k-means algorithm, although hierarchical clustering (hclust) and other functions are also available.

An important point is that, in this and in similar work, researchers usually do not study information on absolute expression levels; they do not compare genes w.r.t. their expression levels. Rather, they study variation in expression among groups. Therefore genes with different means but the same expression pattern are not considered as “different”. For that reason, when clustering genes, it is reasonable to scale all genes to mean=0, and sd=1, so that fixed expression level differences between two genes are practically ignored (this is equivalent to clustering based on a correlation matrix).

With kmeans we need to choose the number of groups k in advance. There are methods to decide which is best, but they do not always work perfectly. A common approach is trying multiple k and confirming that the downstream results do not depend on the choice of k. Here we’ll choose k=9, not for any particular reason.

Also note that k-means is a heuristic algorithm. You can get different results on different runs. And it might sometimes not converge.

A last point: it might be preferred to run clustering only using DE genes, to focus the rest of the study on differential expression patterns.

So start by obtaining the vector of DE genes:

head( nfmat3_aovpc[,"sp"], 20)

## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 
##    0.7455430479    1.0000000000    0.2376752936    0.4730437078 
## ENSG00000000938 ENSG00000000971 ENSG00000001036 ENSG00000001084 
##    1.0000000000    0.1498149761    0.4654322880    0.7003275973 
## ENSG00000001167 ENSG00000001461 ENSG00000001497 ENSG00000001561 
##    0.4294902798    0.0073549844    1.0000000000    0.0121492453 
## ENSG00000001617 ENSG00000001626 ENSG00000001629 ENSG00000001630 
##    0.0033806792    1.0000000000    0.2896711698    0.1853779904 
## ENSG00000001631 ENSG00000002330 ENSG00000002549 ENSG00000002586 
##    0.0835904766    0.0003698013    0.0068091264    0.0060550738

# the elements that have TRUE
head( which( nfmat3_aovpc[,"sp"] < 0.05 ) )

## ENSG00000001461 ENSG00000001561 ENSG00000001617 ENSG00000002330 
##              10              12              13              18 
## ENSG00000002549 ENSG00000002586 
##              19              20

# the gene names
de_sp_genes = names( which( nfmat3_aovpc[,"sp"] < 0.05 ) )
head(de_sp_genes)

## [1] "ENSG00000001461" "ENSG00000001561" "ENSG00000001617" "ENSG00000002330"
## [5] "ENSG00000002549" "ENSG00000002586"

length(de_sp_genes)

## [1] 2228

Now scale the matrix and run k-means:

# center the matrix rows to mean=0 and scales to sd=1
snfmat3 = t( scale( t( nfmat3 ) ) )  
head(snfmat3[,1:3])

##                       HSM1        PTF1        RMM1
## ENSG00000000003 -2.2204756  0.05515234 -0.34576291
## ENSG00000000419 -1.1901890  0.62510628 -1.72943594
## ENSG00000000457  0.4622610 -0.28198057 -0.36731104
## ENSG00000000460  1.8198602 -0.37689865 -0.44927946
## ENSG00000000938  0.5319033 -0.01634660 -0.03882791
## ENSG00000000971 -0.5186128 -0.98552650  1.16246626

rowMeans(head(snfmat3))

## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 
##    3.037694e-16   -1.045701e-15    4.440892e-16    1.279840e-16 
## ENSG00000000938 ENSG00000000971 
##    1.160337e-16    5.813251e-16

apply(head(snfmat3), 1, sd)

## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 
##               1               1               1               1 
## ENSG00000000938 ENSG00000000971 
##               1               1

set.seed(1)    # so that everyone gets the same result
km9 = kmeans( snfmat3[de_sp_genes,] , centers = 9)  # 
table( km9$cluster )

## 
##   1   2   3   4   5   6   7   8   9 
## 215 387 223 137 329 282 117 137 401

Now plot the cluster means:

par( mfrow=c(3,3) )
for( i in 1:9)  {
  genes = names( which( km9$cluster == i ) )
  means = colMeans( snfmat3[ genes, ] )
  plot( means )
  text( means, colnames(snfmat3) )
}

But this looks ugly and confusing. We can create a nice color vector to distinguish species. We can also group the samples according to species.

The below trick uses the fact that a factor object can be converted into numeric:

spcols = as.numeric(species2) + 1
spcols

##  [1] 2 3 4 2 3 4 4 2 3 4 2 3 4 4 2 3 3 2

We now have to reorder, grouping all humans together, etc.:

# the indices of humans
which(species2 == 'HS')

## [1]  1  4  8 11 15 18

# the indices of humans, chimps, and macaques
order(species2)

##  [1]  1  4  8 11 15 18  2  5  9 12 16 17  3  6  7 10 13 14

# see the effect
spcols[order(species2)]

##  [1] 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4

Now draw the graphic again using the new order:

par( mfrow=c(3,3) )
for( i in 1:9)  {
  genes = names( which( km9$cluster == i ) )
  means = colMeans( snfmat3[ genes, ] )[order(species2)]
  cols = spcols[order(species2)]
  maintext = paste( 'cluster', i, 'with', length(genes), 'genes')
  plot( means, col=cols, pch=19, cex=2, main=maintext)
}

Cool - this looks better (although we have no idea about the variation within clusters).

Which clusters look human-specific? Any guesses? What could they represent?

FUNCTIONAL ANALYSIS USING GENE ONTOLOGY (GO)

Now we will be testing the 2 clusters with human-specific profiles for enrichment in specific GO categories, using the topGO package.

The package retrieves GO database information and runs Fisher’s exact tests (also called hypergeometric tests). Note that we use only DE genes as background, which is more appropriate (more straightforward to interpret) than using all genes in the genome or all expressed genes (which is sometimes done). We then correct for multiple testing using the BY method.

Install the libraries to your computer:

source("http://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
biocLite("topGO")

Load the libraries:

library(topGO, verbose = F, quietly = T)
library(org.Hs.eg.db, verbose = F, quietly = T)

# database connection
geneList = km9$cluster

# to collect GO data for the 5th cluster genes (one of the human-specific clusters) in the step
i = 8
GOdata = new( "topGOdata", 
              ontology = "BP", allGenes = geneList, 
              geneSel = function(p) { p == i }, 
              description = "Test", annot = annFUN.org, 
              mapping="org.Hs.eg.db", ID="Ensembl")

## 
## Building most specific GOs .....

##  ( 4496 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 8236 GO terms and 19063 relations. )

## 
## Annotating nodes ...............

##  ( 1930 genes annotated to the GO terms. )

# run the Fisher's exact test on each category
resultFisher = runTest( GOdata, algorithm = "classic", statistic = "fisher")

## 
##           -- Classic Algorithm -- 
## 
##       the algorithm is scoring 2677 nontrivial nodes
##       parameters: 
##           test statistic: fisher

# collect significant terms
sigterms = resultFisher@geneData["SigTerms"]
gores = GenTable(GOdata, classicFisher = resultFisher, topNodes = sigterms)

# now do multiple testing correction & collect the most extreme terms 
qval = p.adjust( gores$classicFisher, met='BY')
gores2 = cbind(gores, qval)
head( gores2 )

# repeat the same with cluster 5 (macaque-specific)
i = 5
GOdata = new( "topGOdata", 
              ontology = "BP", allGenes = geneList, 
              geneSel = function(p) { p == i }, 
              description = "Test", annot = annFUN.org, 
              mapping="org.Hs.eg.db", ID="Ensembl")

## 
## Building most specific GOs .....

##  ( 4496 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 8236 GO terms and 19063 relations. )

## 
## Annotating nodes ...............

##  ( 1930 genes annotated to the GO terms. )

resultFisher = runTest( GOdata, algorithm = "classic", statistic = "fisher")

## 
##           -- Classic Algorithm -- 
## 
##       the algorithm is scoring 3547 nontrivial nodes
##       parameters: 
##           test statistic: fisher

sigterms = resultFisher@geneData["SigTerms"]
gores = GenTable(GOdata, classicFisher = resultFisher, topNodes = sigterms)
qval = p.adjust( gores$classicFisher, met='BY')
gores2 = cbind(gores, qval)
head( gores2 )

Nothing significant. How can we interpret this result?

A precautionary note on GO analysis

There is a problem with this commonly used approach, i.e. running the Fisher’s exact test and applying multiple testing correction. It treats each gene as an independent observation. But genes are co-regulated and show similar expression patterns - gene expression patterns are by default not independent. This violates the assumption of the test.

An alternative approach is permuting the factor of interest and runnning the test (as we did above), sorting genes based on the p-value, choosing the same number of genes as originally chosen, repeating the GO enrichment analysis, and comparing how many GO categories show significant enrichment in the real case vs. permutations. This is a more accurate approach, but will take much more time, and we therefore cannot implement it here.

ENRICHMENT ANALYSIS USING BIOMART

The last exercise is testing for chromosomal enrichment.

Let’s test the putative sex-related genes for sex chromosomal enrichment

If they are indeed non-significant / random, expect no enrichment. Here we will obtain the chromosome location table for Ensembl genes using the biomaRt package, and run our own Fisher’s exact test.

source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

Load the library:

library(biomaRt, verbose = F, quietly = T)

listMarts()

ens = useMart( 
  biomart="ENSEMBL_MART_ENSEMBL", 
  host="grch37.ensembl.org", 
  path="/biomart/martservice",
  dataset="hsapiens_gene_ensembl")
ensChrTable = unique( getBM( attributes=c("ensembl_gene_id", "chromosome_name"), filters="", mart=ens) )
dim( ensChrTable )      # how come the table is so large?

## [1] 63677     2

head( ensChrTable )

tt = table( ensChrTable[,2])
tt[1:30]

## 
##          1         10         11         12         13         14 
##       5363       2260       3208       2818       1217       2244 
##         15         16         17         18         19          2 
##       2080       2343       2903       1127       2910       4047 
##         20         21         22          3          4          5 
##       1317        736       1263       3101       2563       2859 
##          6          7          8          9 GL000191.1 GL000192.1 
##       2905       2876       2386       2323          2          7 
## GL000193.1 GL000194.1 GL000195.1 GL000196.1 GL000199.1 GL000201.1 
##          7          3          2          1          2          2

tt[names(tt) %in% c(as.character(1:22), "X", "Y")]

## 
##    1   10   11   12   13   14   15   16   17   18   19    2   20   21   22 
## 5363 2260 3208 2818 1217 2244 2080 2343 2903 1127 2910 4047 1317  736 1263 
##    3    4    5    6    7    8    9    X    Y 
## 3101 2563 2859 2905 2876 2386 2323 2392  495

Now we need to define gene lists:

sex-related,
not sex-related,
X/Y-linked,
autosomal.

# genes on the sex chromosomes
head(ensChrTable)

# note the use of %in% operator
sex_chr_genes = ensChrTable$ensembl_gene_id [ ensChrTable$chromosome_name %in% c('X', 'Y') ]
nosex_chr_genes = ensChrTable$ensembl_gene_id [ ensChrTable$chromosome_name %in% as.character(1:22) ]

# genes DE with sex
de_sex_genes = names( which( nfmat3_aovp[,"sex"] < 0.05 ) )  # uncorrected p-values
node_sex_genes = names( which( nfmat3_aovp[,"sex"] >= 0.05 ) )
length(de_sex_genes)

## [1] 642

length(node_sex_genes)

## [1] 10002

# check that there is no overlap
intersect(de_sex_genes, node_sex_genes)

## character(0)

Now need the size of overlaps between these vectors. We can use length and intersect, and to avoid repetition, we can write a small function:

# function to calculate the length of the intersection between two vectors x and y
loi = function(x, y) { 
  length( intersect( x, y ) )
}

overlaps = c( 
  loi( de_sex_genes, sex_chr_genes ), 
  loi( de_sex_genes, nosex_chr_genes ), 
  loi( node_sex_genes, sex_chr_genes ), 
  loi( node_sex_genes, nosex_chr_genes ) )

matx = matrix ( overlaps, 2, 2 )
matx

##      [,1] [,2]
## [1,]   25  292
## [2,]  598 9494

fisher.test( matx, alt='greater')

## 
##  Fisher's Exact Test for Count Data
## 
## data:  matx
## p-value = 0.09538
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.9246913       Inf
## sample estimates:
## odds ratio 
##   1.359215

So no significant enrichment, although a weak trend is there for genes on sex chr to be more DE with sex, than other genes (note that the test is already one-sided).

Could it be that genes on sex chr tend to be DE in general?

# the same, this time using genes DE with species
de_sp_genes = names( which( nfmat3_aovpc[,"sp"] < 0.05 ) )   # as defined earlier
node_sp_genes = names( which( nfmat3_aovpc[,"sp"] >= 0.05 ) ) 
length(de_sp_genes)

## [1] 2228

length(node_sp_genes)

## [1] 8416

overlaps = c( 
  loi( de_sp_genes, sex_chr_genes ), 
  loi( de_sp_genes, nosex_chr_genes ), 
  loi( node_sp_genes, sex_chr_genes ), 
  loi( node_sp_genes, nosex_chr_genes ) )

matx = matrix ( overlaps, 2, 2 )
matx

##      [,1] [,2]
## [1,]   63  254
## [2,] 2122 7970

fisher.test( matx, alt='greater')

## 
##  Fisher's Exact Test for Count Data
## 
## data:  matx
## p-value = 0.7114
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.726202      Inf
## sample estimates:
## odds ratio 
##   0.931598

No, apparently not.

CONSISTENT SEX EFFECTS?

The sex effects did not appear significant in ANOVA. Are they at least consistent between species? That is, are the observed effects in the same direction?

hs_sex = apply(nfmat3, 1, function(y) t.test( y[sex2 == "F" & species2 == "HS"], y[sex2 == "M" & species2 == "HS"] )$stat )
rm_sex = apply(nfmat3, 1, function(y) t.test( y[sex2 == "F" & species2 == "RM"], y[sex2 == "M" & species2 == "RM"] )$stat )
pt_sex = apply(nfmat3, 1, function(y) t.test( y[sex2 == "F" & species2 == "PT"], y[sex2 == "M" & species2 == "PT"] )$stat )

plot( hs_sex, rm_sex)
abline( lm( rm_sex ~ hs_sex) )

plot( rm_sex, pt_sex)
abline( lm( rm_sex ~ pt_sex) )

cor.test( hs_sex, rm_sex)

## 
##  Pearson's product-moment correlation
## 
## data:  hs_sex and rm_sex
## t = 6.3222, df = 10642, p-value = 2.683e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04222149 0.08007509
## sample estimates:
##        cor 
## 0.06117028

cor.test( hs_sex, pt_sex)

## 
##  Pearson's product-moment correlation
## 
## data:  hs_sex and pt_sex
## t = 6.4476, df = 10642, p-value = 1.185e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04343328 0.08128120
## sample estimates:
##        cor 
## 0.06237967

cor.test( rm_sex, pt_sex)

## 
##  Pearson's product-moment correlation
## 
## data:  rm_sex and pt_sex
## t = 0.31409, df = 10642, p-value = 0.7535
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01595407  0.02204129
## sample estimates:
##         cor 
## 0.003044711

Note that the p-values here are not reliable, because genes are not really independent.

Overall, we may conclude that sex does not seem to influence the liver transcriptome, at least in any evolutionarily conserved manner, at least in this dataset.

DESEQ2 (Optional exercise)

DESeq2 is one of the most commonly used packages for differential expression estimation for RNA-seq data. It uses raw counts (you do not need to preprocess the data) and uses a negative binomial distribution to model the data. The variance (dispersion) is estimated by pooling across genes of similar expression level, which can be powerful.

We can dowload and install the package from Bioconductor.

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")
library( "DESeq2")

Now try to test the species effect, comparing a model of “sp+sex” with a null model (only “sex”):

## Warning: package 'DESeq2' was built under R version 3.3.2

## Warning: package 'GenomicRanges' was built under R version 3.3.3

## Warning: package 'GenomeInfoDb' was built under R version 3.3.2

# define the factor data frame and the count data
conditions = data.frame( species2, sex2)

# we can use the summarized and filtered dataset
# we need to round, as DESeq2 expects integers
# note that DESeq2 has functions to summarize and filter itself
countmat = round( mat2[filter,] )

# introduce the data and the variables
dds = DESeqDataSetFromMatrix(countData = countmat,
                              colData = conditions,
                              design = ~ species2)

## converting counts to integer mode

dds

## class: DESeqDataSet 
## dim: 10644 18 
## metadata(1): version
## assays(1): counts
## rownames(10644): ENSG00000000003 ENSG00000000419 ...
##   ENSG00000221420 ENSG00000221539
## rowData names(0):
## colnames(18): HSM1 PTF1 ... PTM3 HSF3
## colData names(2): species2 sex2

dds = DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

ddsp = results(dds)
ddsp

## log2 fold change (MAP): species2 RM vs HS 
## Wald test p-value: species2 RM vs HS 
## DataFrame with 10644 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE        stat
##                  <numeric>      <numeric> <numeric>   <numeric>
## ENSG00000000003 239.818673     1.09566600 0.3509261   3.1222130
## ENSG00000000419  34.249658    -0.03983516 0.2560375  -0.1555833
## ENSG00000000457  63.414732     0.12721393 0.2987194   0.4258643
## ENSG00000000460   4.476246    -1.26874744 0.4420353  -2.8702399
## ENSG00000000938  38.998018    -0.80173150 0.4100814  -1.9550544
## ...                    ...            ...       ...         ...
## ENSG00000220820  23.677448   -0.006160719 0.4153827 -0.01483143
## ENSG00000220914   5.088625    1.013367956 0.5589802  1.81288704
## ENSG00000220949  14.412704    0.090229480 0.2687884  0.33568961
## ENSG00000221420   1.859614    1.573032635 0.6187077  2.54244886
## ENSG00000221539   1.433003   -0.042661912 0.6357153 -0.06710852
##                      pvalue        padj
##                   <numeric>   <numeric>
## ENSG00000000003 0.001794970 0.007009494
## ENSG00000000419 0.876361500 0.920589579
## ENSG00000000457 0.670206742 0.771875123
## ENSG00000000460 0.004101605 0.014071598
## ENSG00000000938 0.050576654 0.109935206
## ...                     ...         ...
## ENSG00000220820  0.98816667  0.99167691
## ENSG00000220914  0.06984924  0.14221861
## ENSG00000220949  0.73710494  0.82279124
## ENSG00000221420  0.01100787  0.03208121
## ENSG00000221539  0.94649531  0.96504014

sum(ddsp$padj < 0.05, na.rm = T)

## [1] 4006

par( mfrow=c(1,2))
hist( nfmat3_aovpc[,"sp"], xlab='q', col='red', main='Sp - aov')
hist( ddsp$padj, xlab='q', col='violet', main='Sp - DEseq')

So which method is less conservative? (note that this result was just the opposite comparing DESeq with ANOVA).

Another question remains: are the putative DE genes the same genes or not- i.e. how much do the results overlap?

To answer this, first, using both q-value vectors, make a list of DE gene IDs at a q-value cutoff e.g. q<0.10. Then compare the vectors:

QCUT = 0.10
DE1 = rownames( nfmat3_aovpc ) [ nfmat3_aovpc[,"sp"] < QCUT ]
DE2 = rownames( countmat ) [ ddsp$padj < QCUT ]
length( intersect( DE1, DE2 ) ) / length(DE1)

## [1] 0.8365319

length( intersect( DE1, DE2 ) ) / length(DE2)

## [1] 0.5060116

Apparently most of what ANOVA predicts is also predicted by DESeq2. To decide which method is superior (higher sensitivity and higher specificity), one approach could be comparing the results with other primate liver expression datasets.

MOLPHE 2018 - Comparative Transcriptomics using R

Mehmet Somel