Mapping expression patterns between species

This document demonstrates the underlying algorithms for the manuscript Conservation of expression regulation throughout the animal kingdom. The actual scripts used contain modifications to deal with the large data sizes (~1000 dataset pairs of ~100 million gene pairs each). In addition, this example does not contain GSC weighting used to reduce the effect of uneven tissue sampling.

Download

Setup

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(gplots))
suppressPackageStartupMessages(library(gtools))

myheatmap <- function(d, ...) heatmap.2(d, trace="none", col=redblue, ...)

Create species matrices

For this small example, let’s create expression values for a precursor species, which then “evolves” into species A and B by duplication and elimination of some tissues. As only 1:1 orthologs are used in this document, we don’t take gene duplication and loss into account. Thus, gene 1 in species A is the 1:1 ortholog of gene 1 in species B.

set.seed(42)

N_gene <- 100
N_tissue_precursor <- 4
N_tissue_speciesA <- 5
N_tissue_speciesB <- 6

precursor <- matrix( rnorm(N_gene * N_tissue_precursor, 10, 3), nrow = N_gene )

speciesA <- matrix( nrow = N_gene, ncol = N_tissue_speciesA)
speciesA[,1] <- precursor[,1] + rnorm(N_gene, 0, 2)
speciesA[,2] <- precursor[,1] + rnorm(N_gene, 0, 2)
speciesA[,3] <- precursor[,2] + rnorm(N_gene, 0, 2)
speciesA[,4] <- precursor[,3] + rnorm(N_gene, 0, 2)
speciesA[,5] <- rnorm(N_gene, 10, 3)

speciesB <- matrix( nrow = N_gene, ncol = N_tissue_speciesB)
speciesB[,1] <- precursor[,1] + rnorm(N_gene, 0, 2)
speciesB[,2] <- precursor[,2] + rnorm(N_gene, 0, 2)
speciesB[,3] <- precursor[,2] + rnorm(N_gene, 0, 2)
speciesB[,4] <- precursor[,3] + rnorm(N_gene, 0, 2)
speciesB[,5] <- precursor[,4] + rnorm(N_gene, 0, 2)
speciesB[,6] <- rnorm(N_gene, 10, 3)

Normalize

Expression values are normalized gene by gene. In this way, artifacts that influence the relative expression patterns of genes are avoided.

# gene-wise Z-like median normalization
normalize <- function(m) {
  m <- m - rowMeans(m)
  m / sqrt(rowSums(m**2))
}

speciesA_normalized <- data.frame(normalize(speciesA))
colnames(speciesA_normalized) <- paste0("A", 1:N_tissue_speciesA)
speciesB_normalized <- data.frame(normalize(speciesB))
colnames(speciesB_normalized) <- paste0("B", 1:N_tissue_speciesB)

Cross-species correlations between tissues

It is possible to calculate correlations between tissues, as corresponding rows of the matrix correspond to 1:1 orthologous genes:

myheatmap( cor(speciesA_normalized, speciesB_normalized), Colv=FALSE, Rowv=FALSE, dendrogram="none" )

Due to the differing complement of tissues, we cannot calculate correlations between genes!

cor( t(speciesA_normalized), t(speciesB_normalized) )

## Error in cor(t(speciesA_normalized), t(speciesB_normalized)): incompatible dimensions

Fitting

The main innovation is to map expression values between the species. That is, if the genes from species A were expressed in species B, what would their expression pattern be? Once the expression patterns have been mapped across species, it is possible to compute correlations between genes of different species.

# fit separate linear models for each destination tissue
fitSpecies <- function(df.source, df.dest) {
  lapply(df.dest, function(yt) {
    d <- cbind(yt, df.source[,(1:ncol(df.source)-1)])
    lm(yt ~ ., d)
  })
}

# apply the linear models to source expression patterns,
# creating an expression matrix in the target species
mapSpecies <- function(df.source, model) {
  sapply(model, function(m) predict(m, df.source))
}

## Naively, we could fit models like this:
# modelAB <- fitSpecies(speciesA_normalized, speciesB_normalized)
## ...but then mapping would include the data used to fit the model!
# speciesA_mapped_to_B <- mapSpecies(speciesA_normalized, modelAB)

Mapping expression values using cross-validation

We fit linear models in the context of ten-fold cross-validation to make the predictions: We don’t actually care about the fitted model, but about the prediction made with the model.

mapCV <- function(df.source, df.dest) {
  result <- matrix(nrow=N_gene, ncol=ncol(df.dest))
  colnames(result) <- paste0(colnames(df.dest), "m")
  folds <- sample(rep(1:10, each=ceiling(N_gene/10)))[1:N_gene]
  for (fold in 1:10) {
    mask <- fold == folds
    model <- fitSpecies( df.source[!mask,], df.dest[!mask,] )
    result[mask,] <- mapSpecies(df.source[mask,], model)
  }
  result
}

speciesA_mapped_to_B_cv <- mapCV(speciesA_normalized, speciesB_normalized)
speciesB_mapped_to_A_cv <- mapCV(speciesB_normalized, speciesA_normalized)

Correlations between genes across species

Thanks to the mapping procedure, we can now compute correlations of expression patterns between genes of different species:

cor_gene_AB <- cor(t(speciesA_mapped_to_B_cv), t(speciesB_normalized))
myheatmap( cor_gene_AB, symm=TRUE )

Actual and mapped expression patterns

When creating the example expression matrix for species B, tissue 5 consisted of a precursor tissue that was lost in species A, and tissue 6 was a novel tissue. Consequently, the mapping algorithm fails to predict good expression values for these two tissues.

myheatmap(as.matrix(speciesB_normalized))

myheatmap(speciesA_mapped_to_B_cv)

Finally, we can compare the original and predicted tissue expression patterns to see how well the individual tissues were predicted:

myheatmap( cor(speciesA_mapped_to_B_cv, speciesB_normalized), Colv=FALSE, Rowv=FALSE, dendrogram="none" )

Uncorrected expression distances

For the 1:1 orthologs and unrelated genes, we can now compute Pearson correlations between gene expression patterns. The correlations of 1:1 orthologs are shifted towards higher values.

df.cor_gene_AB <- melt(cor_gene_AB, varnames = c("gene1", "gene2"), value.name = "corr")
df.cor_gene_AB$one_to_one <- df.cor_gene_AB$gene1 == df.cor_gene_AB$gene2
df.cor_gene_AB$uncorrected_dist <- (rank(1-df.cor_gene_AB$corr, ties.method = "random")-1) / nrow(df.cor_gene_AB)

ggplot(df.cor_gene_AB, aes(corr)) + geom_histogram(binwidth=0.1) + facet_grid(one_to_one~., scales="free") + ggtitle("Histogram of correlations")

Nonetheless, the background distribution of correlations is quite broad. To put the correlations of 1:1 orthologs in context, we define the expression distance of two genes as their relative rank among all correlations:

In this way, background gene pairs have a uniform distribution of expression distances, while the distances of 1:1 orthologs are shifted towards lower values:

ggplot(df.cor_gene_AB, aes(uncorrected_dist)) + geom_histogram(binwidth=0.1, origin=0) + facet_grid(one_to_one~., scales="free") + ggtitle("Histogram of uncorrected expression distances")

Coexpression partners

A detailed analysis of the actual data revealed that genes with many coexpression partners in the target species have higher correlations. First, let’s quantify the distribution of correlation between expression patterns in species B, and select the top 10% as cutoff for counting coexpression partners:

coexprB <- cor( t(speciesB_normalized) )
cutoff_coexpr <- quantile(coexprB, 0.9)

df.coexprB <- data.frame(coexpr=as.numeric(coexprB[ upper.tri(coexprB) ]  ))
ggplot(df.coexprB, aes(coexpr)) + geom_histogram(binwidth=0.05, origin=-1) + geom_vline(xintercept=cutoff_coexpr, color="red") + ggtitle("Coexpression of genes within species B")

N_bins <- 5

df.coexpr <- data.frame(gene2 = 1:N_gene, n_coexpr = rowSums(coexprB > cutoff_coexpr))
ggplot(df.coexpr, aes(n_coexpr)) + geom_histogram(binwidth=1) + ggtitle("Number of coexpression partners per gene")

df.coexpr$n_coexpr_qbin <- (rank(df.coexpr$n_coexpr)-1) / nrow(df.coexpr) * N_bins + 1
df.coexpr$n_coexpr_bin <- as.ordered(as.integer(df.coexpr$n_coexpr_qbin))

df.cor_gene_AB <- merge(df.cor_gene_AB, df.coexpr, by="gene2")

ggplot(df.cor_gene_AB, aes(n_coexpr, uncorrected_dist, color=n_coexpr_bin)) + geom_point() + facet_grid(.~one_to_one) + geom_smooth(aes(color="all"), color="black", method="lm")

To ameliorate this bias, we quantify the background distribution of correlations for each bin separately. For a given gene of species B, the expression distance is then computed by interpolating between the two closest bins. (In this example, 5 are used, while in the manuscript 10 bins are used.)

conversion_functions <- df.cor_gene_AB %>% group_by(n_coexpr_bin) %>% 
  do( f = approxfun( .$corr, (rank(1-.$corr, ties.method="random")-1)/nrow(.), rule=2 ) )

conversion_functions <- unlist(conversion_functions[,2])

getBin <- function(n) {
  max((df.coexpr %>% filter(n_coexpr <= n))$n_coexpr_qbin)
}

getDist <- function(correlation, n) {
    # shift bin by 0.5 to move centers
    bin <- getBin(n) - 0.5

    # edge cases: don't interpolate
    if (bin <= 1) return(conversion_functions[[1]](correlation))
    if (bin >= N_bins) return(conversion_functions[[N_bins]](correlation))

    # normal case: interpolate between neighboring bins
    D <- 1 + floor(bin) - bin
    d1 <- conversion_functions[[floor(bin)]](correlation)
    d2 <- conversion_functions[[ceiling(bin)]](correlation)

    D * d1 + (1-D) * d2
}

df.cor_gene_AB <- df.cor_gene_AB %>% group_by(n_coexpr) %>% mutate(corrected_dist = getDist(corr, n_coexpr[1]))

ggplot(df.cor_gene_AB, aes(n_coexpr, corrected_dist, color=n_coexpr_bin)) + geom_point() + facet_grid(.~one_to_one) + geom_smooth(aes(color="all"), color="black", method="lm")

Note that the bias is not completely abolished in this non-biological example.

As a measure for the amount of conservation in the gene expression patterns, we can compute the median expression distance.

median( df.cor_gene_AB[ df.cor_gene_AB$one_to_one, ]$corrected_dist )

## [1] 0.224204