Tidy Data background

S3 Preparing the data for analysis

So here starts the real reproduction, I’m particularly interested in how they implement their \(\theta\) on real data.

RNA.seq           <- read.csv("./data/RNA.seq.csv",        header=T, skip=1)
microarray        <- read.csv("./data/microarray.csv",     header=T, skip=3)
names(microarray) <- sub("T", "timepoint", names(microarray))
go                <- read.csv("./data/Complex_annotation", header=T, sep="\t")
names(go)[6]      <- "Systematic.name"

Getting the Data, i.e. the AT (here called Rel) from Lovell et al.

RNA.seq           <- read.csv("./data/RNA.seq.csv",        header=T, skip=1)
microarray        <- read.csv("./data/microarray.csv",     header=T, skip=3)
names(microarray) <- sub("T", "timepoint", names(microarray))
go                <- read.csv("./data/Complex_annotation", header=T, sep="\t")
names(go)[6]      <- "Systematic.name"


tmp <- data.frame(Systematic.name=RNA.seq$Systematic.name,
                  RNA.seq=rowSums(
                    RNA.seq[,grep("MM[12].*cpc.*", names(RNA.seq))],
                    na.rm=TRUE
                    )/2
                  )

# Drop any mRNAs that have a zero count in the RNA-seq
tmp <- subset(tmp, RNA.seq > 0)

# Do an inner join of Abs and the microarray data based on the Systematic names
tmp <- merge(tmp, microarray, by="Systematic.name")

# Now use the relative abundances at each microarray timepoint to multiply
# the initial mRNA copies per cell. Remove any rows that contain NAs
multipliers   <- as.matrix(tmp[, grep("timepoint", names(tmp))])
Abs           <- data.frame(tmp$RNA.seq * multipliers)
rownames(Abs) <- tmp[,"Systematic.name"]
Abs           <- na.omit(Abs) # nice function to keep only complete cases
# Remember the convention, Abs has now the samples/timepoints as columns
Abs.t         <- as.data.frame(t(Abs))
# neither Abs nor Abs.t is tidy, an observation would be the abundance of an mRNA at a timepoint

So this data has absolute abundances (kind of), the AT will be called Rel for Relative

Rel   <- sweep(Abs,2,colSums(Abs, na.rm=TRUE),"/")
# see sweep is base package and nice, with dplyr it takes me currently two 
# commands, and the rownames go lost

# DivColSums <- function(x) {x/sum(x, na.rm = TRUE)}
# Rel1 <- mutate_each(Abs, funs(DivColSums))

Rel.t <- as.data.frame(t(Rel))

Calculating Theta and More

Here just for a tiny subset to compare his and my method, i.e. to test that my method produces the same results

So here first producing the tiny AT called Relsi

Relsi <- Rel[1:10,] # 10 OTUs
DivColSums <- function(x) {x/sum(x, na.rm = TRUE)}
Relsi <- mutate_each(Relsi, funs(DivColSums)) # renormalise
rownames(Relsi) <- rownames(Rel[1:10,])

Relsi.t <- as.data.frame(t(Relsi)) # species as colums needed for his approach 

Lovells Approach

## calculate the centered logratios for the compositions, i.e. the timepoints

# They use clr, a function from the compositions package, 
# annoyingly, clr makes its calculations on the rows, so not tidy in
# it also produces a matrix, in which the compositions are rows again
# therefore I call the outcome .t since the samples are columns
Relsi.clr.t <- as.data.frame(clr(Relsi.t))

## calculating the logratio variances, so var(log(X/Y)), where is is 
# a species over all timepoints/samples

# here he defines his own functions
#############################################

########## ###################
## Lovell: Replace zeros ####
#############################
# just for this specific df, anyway good to find the next smallest
# which he calls largest

replace.zeros <- function(df){
  x <- as.matrix(df[,grep("timepoint", names(df))])
  next.largest <- (unique(sort(x)))[2] # if there is a zero in, this is the
  # next smalles???
  df[df==0.0] <- next.largest
  df
}

########## #####################################
## Lovell: vlr, calculate logratio variance ####
#################################################
# takes an abundance table as input, (special ones with timepoint columns)
# uses variation and acomp from the compositions package

vlr   <- function(df){
  df <- replace.zeros(df)
  result <- variation(acomp(t(df[,grep("timepoint", names(df))]))) 
  # acomp is a function from the composition package
  # so is variation
  colnames(result) <- rownames(df)
  rownames(result) <- rownames(df)
  result
}

##############################################

# the vlr function takes the normal AT (but with timepoint named columns)
# so here the logratio variances
Relsi.vlr     <- vlr(Relsi) 
# NOTE Relsi.vlr is symmetric!! a mxm matrix


# he uses theta = var(log(X/Y))/var(log(X))
# the numerator, so the logratio variance is in Relsi.vlr
# now he calculates the denominator, and it shows in fact he uses
# here var(clr(X)), and this term is a bit misleading, because the 
# geometric mean was calculated on the compositions, so the timepoints, and
# then the log was taken. So it would be clearer to first turn Relsi into gm and then do the rest

## calculating the var(clr(X))!

Relsi.clr.var <- apply(Relsi.clr.t, 2, var) # This is just a vector, the variance for each mRNA/species over all timepoints

## calculating Phi, so he calls Theta Phi, fine.
# the outcome is of course again an mxm matrix with all the phi/theta values for each species combination
Relsi.phi     <- sweep(Relsi.vlr, 2, Relsi.clr.var, FUN="/")

Lovell calculates also b, r2 and the p value of b individually for plotting purposes

Here you get the real b, i.e. the slope with sign. r2 is the cor(X,Y)^2. p is a way to estimate the p value of the slope, read in the supplementary. He is proud that the function is so fast, calculating all slopes and so on for all species/mRNA combis, so I keep it here.

# here he defines his own functions
#############################################

########## ##############################################################
## Lovell: the function to calculate b, and r2, and well p if you want####
##########################################################################
# as you see below it takes Relsi.clr.t, so where the species/mRNAs are
# columns which makes sense, b, and r2 all relate to species over all timepoints
# OUPTUT: a list with the b, p, and r2 matrixes

sma.df <- function(df){
  df.cor <- stats::cor(df, use="pairwise.complete.obs") # takes all cor of the columns, i.e observations
  df.var <- stats::cov(df, use="pairwise.complete.obs")
  df.sd  <- sqrt(diag(df.var))
  
  # Following the approach of Warton et al. Biol. Rev. (2006), 81, pp. 259-291
  # r.rf2 = cor(X+Y, X-Y)^2
  #       = (var(X) - var(Y))^2  /  ((var(X) + var(Y))^2 - 4cov(X,Y)^2)
  r.rf2   <-
    (outer(diag(df.var), diag(df.var), "-")^2 ) /
    (outer(diag(df.var), diag(df.var), "+")^2 - 4 * df.var^2 )
  
  # At this point the diagonal of r.rf2 will be 0/0 = NaN. The correlation should be 0
  diag(r.rf2) <- 0
  res.dof     <- nrow(df) - 2
  F           <- r.rf2/(1 - r.rf2) * res.dof
    
  list(b=sign(df.cor) * outer(df.sd, df.sd, "/"), # slope = sign(s_xy) s_y/s_x
       p=1 - pf(F, 1, res.dof),                   # p-value of the test that b = 1
       r2=df.cor^2)                               # the squared correlation coefficient
}

#################################################


Relsi.sma     <- sma.df(Relsi.clr.t)
#Relsi.sma is a list with b # the slope between these two mRNAs over time (NOTE, the real slope, so can be negative), p is a p-value of this slope, should go out, r2 is the cor^2

Own calculations of Theta/Phi

############### Own Functions

##############################################
###### geometric mean of a composition vector##
################################################
# INPUT the vector
# OUTPUT: the geometric mean, a numeric value 

# I initially thought of simply
# gm <- function(Comp) { prod(Comp)^(1/length(Comp))}
# can not handle 0 or NAs, and also prod for large compositions is often zero.
# So I use a fucntion from stackoverflow
# and find the link: http://stackoverflow.com/questions/2602583/geometric-mean-is-there-a-built-in
# see that page it could even be extended, but since I will not have NAs and also no 0s usually (pseudocount) this function should be fine
gm <- function(x, na.rm=TRUE){
  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
# makes use of a nice trick exp(log(5)) = 5.
# then you can remove all 0 (in my simple solution one 0 makes the gm 0).
# so lets say x is a composition 
# gm = (x1*x2* *xm)^(1/length(x))
# exp( log( (x1*x2* *xm)^(1/length(x)) ) ) # log(3^4) = 4 * log(3)
# exp( (1/length(x))*log( (x1*x2* *xm) ) )
# exp( (1/length(x)) * sum(log(x1), log(x2),..., log(xm)))
# so easy


##############################################
###### norm Composition by gm##
################################################
# INPUT the compoistion vector
# OUTPUT: The composition divided by its gm 
# this function is only used for explanation purposes

gmSelf <- function(Comp) { # for composition
        Comp/gm(Comp)
        
}

##############################################
###### centered log ratio of a Composition##
################################################
# INPUT the compoistion vector
# OUTPUT: The composition gm and log transformed (clr) 

clrSelf <- function(Comp) { # for composition
        log(Comp/gm(Comp))
        
}

##############################################
###### Log ratio variance of two vectors ##
################################################
# INPUT: X and Y are vectors, e.g. rel.abundances of
        # a species over all samples
# OUTPUT: the logratio variance a numeric value

LogVar <- function(X,Y) { 
        LogVar <- var(log(abs(X/Y))) # it is a symmetric function, so it does ot matter who is numerator.
        return(LogVar)
}


###################################################
###### LogVariances between all OTUs over samples##
###################################################
# INPUT: AT_T, so a data frame with the m OTUs as columns
# OUTPUT: A m x m matrix with all the logratio variances (symmetric!!)

LogVariances <- function(df) {
        Matrix <- sapply(seq(ncol(df)), function(i) 
                sapply(seq(ncol(df)),
                       function(j) LogVar(df[,i], df[,j])))
        rownames(Matrix) <- colnames(df)
        colnames(Matrix) <- colnames(df)
        return(Matrix)
        }

##############################################
###### Variance Ratio        ##
################################################
# INPUT: X and Y are vectors, e.g. rel.abundances of
        # a species over all samples
# OUTPUT: the ratio of their variances

VarRat <- function(X,Y) { 
        VarRat <- var(Y)/var(X) 
        return(VarRat)
}


###################################################
###### beta square calculator ##
###################################################
# INPUT: AT_clr_T, so a data frame with the m OTUs as columns after the
# clr of the samples has been taken
# OUTPUT: A m x m matrix with all the beta squares (var(X)/var(Y))

betasqcalc <- function(df) {
        Matrix <- sapply(seq(ncol(df)), function(i) 
                sapply(seq(ncol(df)),
                       function(j) VarRat(df[,i], df[,j])))
        rownames(Matrix) <- colnames(df)
        colnames(Matrix) <- colnames(df)
        return(Matrix)
        }


###################################################
###### cor calculator ##
###################################################
# INPUT: AT_clr_T, so a data frame with the m OTUs as columns after the
# clr of the samples has been taken
# OUTPUT: A m x m matrix with all the cor(X,Y)

corcalc <- function(df) {
        Matrix <- sapply(seq(ncol(df)), function(i) 
                sapply(seq(ncol(df)),
                       function(j) cor(df[,i], df[,j])))
        rownames(Matrix) <- colnames(df)
        colnames(Matrix) <- colnames(df)
        return(Matrix)
        }


############################################
# I start from Relsi (i.e, the AT)

## first get the clr of Relsi(AT), so where each sample (compoisiton)
# is in the clr form
Relsi.clr_Self <-  mutate_each(Relsi, funs(clrSelf))
# NOTE this is t from Relsi.clr!!!!
rownames(Relsi.clr_Self) <- rownames(Relsi)
colSums(Relsi.clr_Self) # now they sum up to 0
##    timepoint0  timepoint0.5    timepoint1    timepoint2    timepoint3 
## -1.595946e-16  1.360023e-15  7.494005e-16 -5.273559e-16 -8.604228e-16 
##    timepoint4    timepoint6    timepoint9   timepoint12   timepoint16 
##  1.075529e-16  7.216450e-16 -9.436896e-16 -3.108624e-15 -1.373901e-15 
##   timepoint20   timepoint24   timepoint48   timepoint72   timepoint96 
##  9.436896e-16 -1.998401e-15 -1.340074e-15 -2.359224e-15 -4.524159e-15 
##  timepoint168 
## -5.967449e-16
### For explanation purposes only #######################
# I go the extra step to just normalise the 
## AT by dividing the compositions/samples by their gm

Relsi.gm <- mutate_each(Relsi, funs(gmSelf))
rownames(Relsi.gm) <- rownames(Relsi)
## NB: log(Relsi.gm) = Relsi.clr_Self
################################################


## calculating the LogRatio Variances
# not this is an operation on the species, so they should be columsn

Relsi.vlr_Self <- LogVariances(Relsi.t)
# is identical to Relsi.vlr
# all.equal(Relsi.vlr, Relsi.vlr_Self)


### For explanation purposes only #######################
# I was surprised that he calculates logratio variances not from
# clr. The reason is, in clr there are negative numbers, because
# the log has already been taken, so you would take it twice.

# Thinks get clear if you calculate it on Relsi.gm
Relsi.gm_t <- as.data.frame(t(Relsi.gm))
Relsi.vlr_gm <- LogVariances(Relsi.gm_t)

## NB: all.equal(Relsi.vlr_gm, Relsi.vlr_Self)
## SO THAT IS ALL THE EXPLANATION, he wanted to save time and therefore
# puts the log taking already in the clr function. Clearer it would have 
# been to stick to gm and take the log later
################################################

## calculating the variances (so var(clr(X))), the denominator for theta
# here I could stick with his approach
Relsi.clr_Self.t <- as.data.frame(t(Relsi.clr_Self))
Relsi.clr.var_Self <- apply(Relsi.clr_Self.t, 2, var) 
# summarise_each(Relsi.clr_Self.t, funs(var)) would work too but produce 
# a data.frame

## calculating Phi, so he calls Theta Phi, fine.
# the outcome is of course again an mxm matrix with all the phi/theta values for each species combination
Relsi.phi_Self <- sweep(Relsi.vlr_Self, 2, Relsi.clr.var_Self, FUN="/")

# same phi as him all understood

Implementing catching inverse proportional.

So I can easily calculate phis/theta using his method, i.e. \[ theta(log(X), log(Y)) = var(log(X/Y))/var(log X) \]

But I realised that in his direct equation for theta he puts an absolute around the correlation (he could do the same on the slope). And I think this might allow him to catch also inverse proportional ones, so I would like to implement this way too

I could first use his sma.df function

Kicking out the p-value that I do not need

Theta <- 1 + (Relsi.sma$b)^2 -2*Relsi.sma$b*sqrt(Relsi.sma$r2)
# like this it is all.equal(Theta, Relsi.phi)
ThetaAbs <- 1 + (Relsi.sma$b)^2 -2*abs(Relsi.sma$b)*sqrt(Relsi.sma$r2)

I could also implement the long way

# as seen from his point, matrixes are good for this
## calculate beta square (clear that way you will not learn if slope is
# positive or negative)
betasquare <- betasqcalc(Relsi.clr_Self.t)
# indeed all.equal(Relsi.sma$b^2,betasquare)
correlations <- corcalc(Relsi.clr_Self.t)
# indeed all.equal(correlations^2, Relsi.sma$r2)

theta_long <- 1 + betasquare - 2*sqrt(betasquare)*correlations
# and again: all.equal(theta_long, Relsi.phi)
theta_abs <- 1 + betasquare - 2*sqrt(betasquare)*abs(correlations)
# all.equal(theta_abs, ThetaAbs)

Working with the entire data, Timing section

  • this section learned me a lot about the importance of vectorization.
  • it is just amazing how long my sapply approaches took for this large data set

His method

Just Theta

# # system.time() takes only a single expression so
# ptm <- proc.time()
# Rel.clr.t <- as.data.frame(clr(Rel.t))
# Rel.vlr     <- vlr(Rel) 
# 
# Rel.clr.var <- apply(Rel.clr.t, 2, var) 
# Rel.phi     <- sweep(Rel.vlr, 2, Rel.clr.var, FUN="/")
# proc.time() - ptm
# # system: 0.202, user 1.637

His method with sma.df (i.e. with the p-values, b, cor)

# # system.time() takes only a single expression so
# ptm <- proc.time()
# Rel.clr.t <- as.data.frame(clr(Rel.t))
# Rel.vlr     <- vlr(Rel) 
# 
# Rel.clr.var <- apply(Rel.clr.t, 2, var) 
# Rel.phi     <- sweep(Rel.vlr, 2, Rel.clr.var, FUN="/")
# Rel.sma     <- sma.df(Rel.clr.t)
# proc.time() - ptm
# # system: 0.635, user 9.477

His method directly via sma.df (with option to get)

# ptm <- proc.time()
# Rel.clr.t <- as.data.frame(clr(Rel.t))
# Rel.sma     <- sma.df(Rel.clr.t)
# 
# Theta <- 1 + (Rel.sma$b)^2 -2*Rel.sma$b*sqrt(Rel.sma$r2)
# # like this it is all.equal(Theta, Rel.phi)
# ThetaAbs <- 1 + (Rel.sma$b)^2 -2*abs(Rel.sma$b)*sqrt(Rel.sma$r2)
# proc.time() - ptm
# 
# #user  system elapsed 
# #8.377   0.726   9.286

My methods

TIME, WTF!!!

# ptm <- proc.time()
# 
# Rel.clr_Self <-  mutate_each(Rel, funs(clrSelf))
# rownames(Rel.clr_Self) <- rownames(Rel)
# Rel.clr_Self.t <- as.data.frame(t(Rel.clr_Self))
# 
# Rel.vlr_Self <- LogVariances(Rel.t)
# 
# 
# Rel.clr.var_Self <- apply(Rel.clr_Self.t, 2, var) 
# Rel.phi_Self <- sweep(Rel.vlr_Self, 2, Rel.clr.var_Self, FUN="/")
# 
# proc.time() - ptm
# #user  system elapsed 
# #673.217   4.012 684.280 
# The Time was eaten by the LogVariances function. 
# HOW can vlr from the compositions package be so much faster?
# at least all.equal(Rel.phi, Rel.phi_Self)

After that time diseaster I almost did not dear to

# ptm <- proc.time()
# 
# Rel.clr_Self <-  mutate_each(Rel, funs(clrSelf))
# rownames(Rel.clr_Self) <- rownames(Rel)
# Rel.clr_Self.t <- as.data.frame(t(Rel.clr_Self))
# 
# betasquare <- betasqcalc(Rel.clr_Self.t)
# 
# correlations <- corcalc(Rel.clr_Self.t)
# 
# 
# theta_long <- 1 + betasquare - 2*sqrt(betasquare)*correlations
# # and again: all.equal(theta_long, Relsi.phi)
# theta_abs <- 1 + betasquare - 2*sqrt(betasquare)*abs(correlations)
# 
# proc.time() - ptm
# #user   system  elapsed 
# #2011.347   16.063 2065.052 

This contained two sapply functions, and zupp, it took ages.

The fast algorithm for implementation.

What did I learn from the timing section? - you need to work vectorised, vectorised - sapply is obviously more like a loop?

So I could simply use his implementation, because the timing section also revealed that the results are absolutely the same. But I do not need the p value of the slope, and maybe I can avoid the use of the compositions package. Loading packages can take time, too. I also hoped to learn something from his sma.df function, by taking the pvalue calculations out

########## ##############################################################
## sma.df.Self, Lovell's adjusted and without p      ####
##########################################################################
# INPUT: AT_T or more likely AT_clr_T, so columns are the OTUs over all samples
# OUPTUT: a list with the b, and corre matrixes, b contains all the slopes
# between OTU1 and OTU2 and so on, corre matrix contains the correlations between the OTUs over all samples, so both mxm matrixes if m is the number of OTUs 

sma.df.Short <- function(df){
  df.cor <- stats::cor(df, use="pairwise.complete.obs") # works obviously
  # vectorised
  variances <- apply(df, 2, stats::var)  
  list(b=sign(df.cor) * sqrt(outer(variances, variances, "/")), # slope = sign(s_xy) s_y
       corre =df.cor)                               # the correlation
}
ptm <- proc.time()
Rel.clr.t <- as.data.frame(clr(Rel.t))
proc.time() - ptm
##    user  system elapsed 
##   0.140   0.005   0.145
Rel.sma     <- sma.df(Rel.clr.t)

Theta <- 1 + (Rel.sma$b)^2 -2*Rel.sma$b*sqrt(Rel.sma$r2)
# like this it is all.equal(Theta, Rel.phi)
ThetaAbs <- 1 + (Rel.sma$b)^2 -2*abs(Rel.sma$b)*sqrt(Rel.sma$r2)
proc.time() - ptm
##    user  system elapsed 
##   8.823   0.599   9.765
ptm <- proc.time()
Rel.clr_Self <-  mutate_each(Rel, funs(clrSelf))
rownames(Rel.clr_Self) <- rownames(Rel)
Rel.clr_Self.t <- as.data.frame(t(Rel.clr_Self))


Rel.sma     <- sma.df.Short(Rel.clr_Self.t)

ThetaS <- 1 + (Rel.sma$b)^2 -2*abs(Rel.sma$b)*Rel.sma$corre
ThetaAbsS <- 1 + (Rel.sma$b)^2 -2*abs(Rel.sma$b)*abs(Rel.sma$corre)
ThetaTestS <- 1 + (Rel.sma$b)^2 -2*Rel.sma$b*Rel.sma$corre
proc.time() - ptm
##    user  system elapsed 
##   1.783   0.205   1.998