So here starts the real reproduction, I’m particularly interested in how they implement their \(\theta\) on real data.
RNA.seq <- read.csv("./data/RNA.seq.csv", header=T, skip=1)
microarray <- read.csv("./data/microarray.csv", header=T, skip=3)
names(microarray) <- sub("T", "timepoint", names(microarray))
go <- read.csv("./data/Complex_annotation", header=T, sep="\t")
names(go)[6] <- "Systematic.name"
RNA.seq <- read.csv("./data/RNA.seq.csv", header=T, skip=1)
microarray <- read.csv("./data/microarray.csv", header=T, skip=3)
names(microarray) <- sub("T", "timepoint", names(microarray))
go <- read.csv("./data/Complex_annotation", header=T, sep="\t")
names(go)[6] <- "Systematic.name"
tmp <- data.frame(Systematic.name=RNA.seq$Systematic.name,
RNA.seq=rowSums(
RNA.seq[,grep("MM[12].*cpc.*", names(RNA.seq))],
na.rm=TRUE
)/2
)
# Drop any mRNAs that have a zero count in the RNA-seq
tmp <- subset(tmp, RNA.seq > 0)
# Do an inner join of Abs and the microarray data based on the Systematic names
tmp <- merge(tmp, microarray, by="Systematic.name")
# Now use the relative abundances at each microarray timepoint to multiply
# the initial mRNA copies per cell. Remove any rows that contain NAs
multipliers <- as.matrix(tmp[, grep("timepoint", names(tmp))])
Abs <- data.frame(tmp$RNA.seq * multipliers)
rownames(Abs) <- tmp[,"Systematic.name"]
Abs <- na.omit(Abs) # nice function to keep only complete cases
# Remember the convention, Abs has now the samples/timepoints as columns
Abs.t <- as.data.frame(t(Abs))
# neither Abs nor Abs.t is tidy, an observation would be the abundance of an mRNA at a timepoint
So this data has absolute abundances (kind of), the AT will be called Rel for Relative
Rel <- sweep(Abs,2,colSums(Abs, na.rm=TRUE),"/")
# see sweep is base package and nice, with dplyr it takes me currently two
# commands, and the rownames go lost
# DivColSums <- function(x) {x/sum(x, na.rm = TRUE)}
# Rel1 <- mutate_each(Abs, funs(DivColSums))
Rel.t <- as.data.frame(t(Rel))
Here just for a tiny subset to compare his and my method, i.e. to test that my method produces the same results
So here first producing the tiny AT called Relsi
Relsi <- Rel[1:10,] # 10 OTUs
DivColSums <- function(x) {x/sum(x, na.rm = TRUE)}
Relsi <- mutate_each(Relsi, funs(DivColSums)) # renormalise
rownames(Relsi) <- rownames(Rel[1:10,])
Relsi.t <- as.data.frame(t(Relsi)) # species as colums needed for his approach
## calculate the centered logratios for the compositions, i.e. the timepoints
# They use clr, a function from the compositions package,
# annoyingly, clr makes its calculations on the rows, so not tidy in
# it also produces a matrix, in which the compositions are rows again
# therefore I call the outcome .t since the samples are columns
Relsi.clr.t <- as.data.frame(clr(Relsi.t))
## calculating the logratio variances, so var(log(X/Y)), where is is
# a species over all timepoints/samples
# here he defines his own functions
#############################################
########## ###################
## Lovell: Replace zeros ####
#############################
# just for this specific df, anyway good to find the next smallest
# which he calls largest
replace.zeros <- function(df){
x <- as.matrix(df[,grep("timepoint", names(df))])
next.largest <- (unique(sort(x)))[2] # if there is a zero in, this is the
# next smalles???
df[df==0.0] <- next.largest
df
}
########## #####################################
## Lovell: vlr, calculate logratio variance ####
#################################################
# takes an abundance table as input, (special ones with timepoint columns)
# uses variation and acomp from the compositions package
vlr <- function(df){
df <- replace.zeros(df)
result <- variation(acomp(t(df[,grep("timepoint", names(df))])))
# acomp is a function from the composition package
# so is variation
colnames(result) <- rownames(df)
rownames(result) <- rownames(df)
result
}
##############################################
# the vlr function takes the normal AT (but with timepoint named columns)
# so here the logratio variances
Relsi.vlr <- vlr(Relsi)
# NOTE Relsi.vlr is symmetric!! a mxm matrix
# he uses theta = var(log(X/Y))/var(log(X))
# the numerator, so the logratio variance is in Relsi.vlr
# now he calculates the denominator, and it shows in fact he uses
# here var(clr(X)), and this term is a bit misleading, because the
# geometric mean was calculated on the compositions, so the timepoints, and
# then the log was taken. So it would be clearer to first turn Relsi into gm and then do the rest
## calculating the var(clr(X))!
Relsi.clr.var <- apply(Relsi.clr.t, 2, var) # This is just a vector, the variance for each mRNA/species over all timepoints
## calculating Phi, so he calls Theta Phi, fine.
# the outcome is of course again an mxm matrix with all the phi/theta values for each species combination
Relsi.phi <- sweep(Relsi.vlr, 2, Relsi.clr.var, FUN="/")
Here you get the real b, i.e. the slope with sign. r2 is the cor(X,Y)^2. p is a way to estimate the p value of the slope, read in the supplementary. He is proud that the function is so fast, calculating all slopes and so on for all species/mRNA combis, so I keep it here.
# here he defines his own functions
#############################################
########## ##############################################################
## Lovell: the function to calculate b, and r2, and well p if you want####
##########################################################################
# as you see below it takes Relsi.clr.t, so where the species/mRNAs are
# columns which makes sense, b, and r2 all relate to species over all timepoints
# OUPTUT: a list with the b, p, and r2 matrixes
sma.df <- function(df){
df.cor <- stats::cor(df, use="pairwise.complete.obs") # takes all cor of the columns, i.e observations
df.var <- stats::cov(df, use="pairwise.complete.obs")
df.sd <- sqrt(diag(df.var))
# Following the approach of Warton et al. Biol. Rev. (2006), 81, pp. 259-291
# r.rf2 = cor(X+Y, X-Y)^2
# = (var(X) - var(Y))^2 / ((var(X) + var(Y))^2 - 4cov(X,Y)^2)
r.rf2 <-
(outer(diag(df.var), diag(df.var), "-")^2 ) /
(outer(diag(df.var), diag(df.var), "+")^2 - 4 * df.var^2 )
# At this point the diagonal of r.rf2 will be 0/0 = NaN. The correlation should be 0
diag(r.rf2) <- 0
res.dof <- nrow(df) - 2
F <- r.rf2/(1 - r.rf2) * res.dof
list(b=sign(df.cor) * outer(df.sd, df.sd, "/"), # slope = sign(s_xy) s_y/s_x
p=1 - pf(F, 1, res.dof), # p-value of the test that b = 1
r2=df.cor^2) # the squared correlation coefficient
}
#################################################
Relsi.sma <- sma.df(Relsi.clr.t)
#Relsi.sma is a list with b # the slope between these two mRNAs over time (NOTE, the real slope, so can be negative), p is a p-value of this slope, should go out, r2 is the cor^2
############### Own Functions
##############################################
###### geometric mean of a composition vector##
################################################
# INPUT the vector
# OUTPUT: the geometric mean, a numeric value
# I initially thought of simply
# gm <- function(Comp) { prod(Comp)^(1/length(Comp))}
# can not handle 0 or NAs, and also prod for large compositions is often zero.
# So I use a fucntion from stackoverflow
# and find the link: http://stackoverflow.com/questions/2602583/geometric-mean-is-there-a-built-in
# see that page it could even be extended, but since I will not have NAs and also no 0s usually (pseudocount) this function should be fine
gm <- function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
# makes use of a nice trick exp(log(5)) = 5.
# then you can remove all 0 (in my simple solution one 0 makes the gm 0).
# so lets say x is a composition
# gm = (x1*x2* *xm)^(1/length(x))
# exp( log( (x1*x2* *xm)^(1/length(x)) ) ) # log(3^4) = 4 * log(3)
# exp( (1/length(x))*log( (x1*x2* *xm) ) )
# exp( (1/length(x)) * sum(log(x1), log(x2),..., log(xm)))
# so easy
##############################################
###### norm Composition by gm##
################################################
# INPUT the compoistion vector
# OUTPUT: The composition divided by its gm
# this function is only used for explanation purposes
gmSelf <- function(Comp) { # for composition
Comp/gm(Comp)
}
##############################################
###### centered log ratio of a Composition##
################################################
# INPUT the compoistion vector
# OUTPUT: The composition gm and log transformed (clr)
clrSelf <- function(Comp) { # for composition
log(Comp/gm(Comp))
}
##############################################
###### Log ratio variance of two vectors ##
################################################
# INPUT: X and Y are vectors, e.g. rel.abundances of
# a species over all samples
# OUTPUT: the logratio variance a numeric value
LogVar <- function(X,Y) {
LogVar <- var(log(abs(X/Y))) # it is a symmetric function, so it does ot matter who is numerator.
return(LogVar)
}
###################################################
###### LogVariances between all OTUs over samples##
###################################################
# INPUT: AT_T, so a data frame with the m OTUs as columns
# OUTPUT: A m x m matrix with all the logratio variances (symmetric!!)
LogVariances <- function(df) {
Matrix <- sapply(seq(ncol(df)), function(i)
sapply(seq(ncol(df)),
function(j) LogVar(df[,i], df[,j])))
rownames(Matrix) <- colnames(df)
colnames(Matrix) <- colnames(df)
return(Matrix)
}
##############################################
###### Variance Ratio ##
################################################
# INPUT: X and Y are vectors, e.g. rel.abundances of
# a species over all samples
# OUTPUT: the ratio of their variances
VarRat <- function(X,Y) {
VarRat <- var(Y)/var(X)
return(VarRat)
}
###################################################
###### beta square calculator ##
###################################################
# INPUT: AT_clr_T, so a data frame with the m OTUs as columns after the
# clr of the samples has been taken
# OUTPUT: A m x m matrix with all the beta squares (var(X)/var(Y))
betasqcalc <- function(df) {
Matrix <- sapply(seq(ncol(df)), function(i)
sapply(seq(ncol(df)),
function(j) VarRat(df[,i], df[,j])))
rownames(Matrix) <- colnames(df)
colnames(Matrix) <- colnames(df)
return(Matrix)
}
###################################################
###### cor calculator ##
###################################################
# INPUT: AT_clr_T, so a data frame with the m OTUs as columns after the
# clr of the samples has been taken
# OUTPUT: A m x m matrix with all the cor(X,Y)
corcalc <- function(df) {
Matrix <- sapply(seq(ncol(df)), function(i)
sapply(seq(ncol(df)),
function(j) cor(df[,i], df[,j])))
rownames(Matrix) <- colnames(df)
colnames(Matrix) <- colnames(df)
return(Matrix)
}
############################################
# I start from Relsi (i.e, the AT)
## first get the clr of Relsi(AT), so where each sample (compoisiton)
# is in the clr form
Relsi.clr_Self <- mutate_each(Relsi, funs(clrSelf))
# NOTE this is t from Relsi.clr!!!!
rownames(Relsi.clr_Self) <- rownames(Relsi)
colSums(Relsi.clr_Self) # now they sum up to 0
## timepoint0 timepoint0.5 timepoint1 timepoint2 timepoint3
## -1.595946e-16 1.360023e-15 7.494005e-16 -5.273559e-16 -8.604228e-16
## timepoint4 timepoint6 timepoint9 timepoint12 timepoint16
## 1.075529e-16 7.216450e-16 -9.436896e-16 -3.108624e-15 -1.373901e-15
## timepoint20 timepoint24 timepoint48 timepoint72 timepoint96
## 9.436896e-16 -1.998401e-15 -1.340074e-15 -2.359224e-15 -4.524159e-15
## timepoint168
## -5.967449e-16
### For explanation purposes only #######################
# I go the extra step to just normalise the
## AT by dividing the compositions/samples by their gm
Relsi.gm <- mutate_each(Relsi, funs(gmSelf))
rownames(Relsi.gm) <- rownames(Relsi)
## NB: log(Relsi.gm) = Relsi.clr_Self
################################################
## calculating the LogRatio Variances
# not this is an operation on the species, so they should be columsn
Relsi.vlr_Self <- LogVariances(Relsi.t)
# is identical to Relsi.vlr
# all.equal(Relsi.vlr, Relsi.vlr_Self)
### For explanation purposes only #######################
# I was surprised that he calculates logratio variances not from
# clr. The reason is, in clr there are negative numbers, because
# the log has already been taken, so you would take it twice.
# Thinks get clear if you calculate it on Relsi.gm
Relsi.gm_t <- as.data.frame(t(Relsi.gm))
Relsi.vlr_gm <- LogVariances(Relsi.gm_t)
## NB: all.equal(Relsi.vlr_gm, Relsi.vlr_Self)
## SO THAT IS ALL THE EXPLANATION, he wanted to save time and therefore
# puts the log taking already in the clr function. Clearer it would have
# been to stick to gm and take the log later
################################################
## calculating the variances (so var(clr(X))), the denominator for theta
# here I could stick with his approach
Relsi.clr_Self.t <- as.data.frame(t(Relsi.clr_Self))
Relsi.clr.var_Self <- apply(Relsi.clr_Self.t, 2, var)
# summarise_each(Relsi.clr_Self.t, funs(var)) would work too but produce
# a data.frame
## calculating Phi, so he calls Theta Phi, fine.
# the outcome is of course again an mxm matrix with all the phi/theta values for each species combination
Relsi.phi_Self <- sweep(Relsi.vlr_Self, 2, Relsi.clr.var_Self, FUN="/")
# same phi as him all understood
So I can easily calculate phis/theta using his method, i.e. \[ theta(log(X), log(Y)) = var(log(X/Y))/var(log X) \]
But I realised that in his direct equation for theta he puts an absolute around the correlation (he could do the same on the slope). And I think this might allow him to catch also inverse proportional ones, so I would like to implement this way too
Kicking out the p-value that I do not need
Theta <- 1 + (Relsi.sma$b)^2 -2*Relsi.sma$b*sqrt(Relsi.sma$r2)
# like this it is all.equal(Theta, Relsi.phi)
ThetaAbs <- 1 + (Relsi.sma$b)^2 -2*abs(Relsi.sma$b)*sqrt(Relsi.sma$r2)
# as seen from his point, matrixes are good for this
## calculate beta square (clear that way you will not learn if slope is
# positive or negative)
betasquare <- betasqcalc(Relsi.clr_Self.t)
# indeed all.equal(Relsi.sma$b^2,betasquare)
correlations <- corcalc(Relsi.clr_Self.t)
# indeed all.equal(correlations^2, Relsi.sma$r2)
theta_long <- 1 + betasquare - 2*sqrt(betasquare)*correlations
# and again: all.equal(theta_long, Relsi.phi)
theta_abs <- 1 + betasquare - 2*sqrt(betasquare)*abs(correlations)
# all.equal(theta_abs, ThetaAbs)
# # system.time() takes only a single expression so
# ptm <- proc.time()
# Rel.clr.t <- as.data.frame(clr(Rel.t))
# Rel.vlr <- vlr(Rel)
#
# Rel.clr.var <- apply(Rel.clr.t, 2, var)
# Rel.phi <- sweep(Rel.vlr, 2, Rel.clr.var, FUN="/")
# proc.time() - ptm
# # system: 0.202, user 1.637
# # system.time() takes only a single expression so
# ptm <- proc.time()
# Rel.clr.t <- as.data.frame(clr(Rel.t))
# Rel.vlr <- vlr(Rel)
#
# Rel.clr.var <- apply(Rel.clr.t, 2, var)
# Rel.phi <- sweep(Rel.vlr, 2, Rel.clr.var, FUN="/")
# Rel.sma <- sma.df(Rel.clr.t)
# proc.time() - ptm
# # system: 0.635, user 9.477
# ptm <- proc.time()
# Rel.clr.t <- as.data.frame(clr(Rel.t))
# Rel.sma <- sma.df(Rel.clr.t)
#
# Theta <- 1 + (Rel.sma$b)^2 -2*Rel.sma$b*sqrt(Rel.sma$r2)
# # like this it is all.equal(Theta, Rel.phi)
# ThetaAbs <- 1 + (Rel.sma$b)^2 -2*abs(Rel.sma$b)*sqrt(Rel.sma$r2)
# proc.time() - ptm
#
# #user system elapsed
# #8.377 0.726 9.286
TIME, WTF!!!
# ptm <- proc.time()
#
# Rel.clr_Self <- mutate_each(Rel, funs(clrSelf))
# rownames(Rel.clr_Self) <- rownames(Rel)
# Rel.clr_Self.t <- as.data.frame(t(Rel.clr_Self))
#
# Rel.vlr_Self <- LogVariances(Rel.t)
#
#
# Rel.clr.var_Self <- apply(Rel.clr_Self.t, 2, var)
# Rel.phi_Self <- sweep(Rel.vlr_Self, 2, Rel.clr.var_Self, FUN="/")
#
# proc.time() - ptm
# #user system elapsed
# #673.217 4.012 684.280
# The Time was eaten by the LogVariances function.
# HOW can vlr from the compositions package be so much faster?
# at least all.equal(Rel.phi, Rel.phi_Self)
After that time diseaster I almost did not dear to
# ptm <- proc.time()
#
# Rel.clr_Self <- mutate_each(Rel, funs(clrSelf))
# rownames(Rel.clr_Self) <- rownames(Rel)
# Rel.clr_Self.t <- as.data.frame(t(Rel.clr_Self))
#
# betasquare <- betasqcalc(Rel.clr_Self.t)
#
# correlations <- corcalc(Rel.clr_Self.t)
#
#
# theta_long <- 1 + betasquare - 2*sqrt(betasquare)*correlations
# # and again: all.equal(theta_long, Relsi.phi)
# theta_abs <- 1 + betasquare - 2*sqrt(betasquare)*abs(correlations)
#
# proc.time() - ptm
# #user system elapsed
# #2011.347 16.063 2065.052
This contained two sapply functions, and zupp, it took ages.
What did I learn from the timing section? - you need to work vectorised, vectorised - sapply is obviously more like a loop?
So I could simply use his implementation, because the timing section also revealed that the results are absolutely the same. But I do not need the p value of the slope, and maybe I can avoid the use of the compositions package. Loading packages can take time, too. I also hoped to learn something from his sma.df function, by taking the pvalue calculations out
########## ##############################################################
## sma.df.Self, Lovell's adjusted and without p ####
##########################################################################
# INPUT: AT_T or more likely AT_clr_T, so columns are the OTUs over all samples
# OUPTUT: a list with the b, and corre matrixes, b contains all the slopes
# between OTU1 and OTU2 and so on, corre matrix contains the correlations between the OTUs over all samples, so both mxm matrixes if m is the number of OTUs
sma.df.Short <- function(df){
df.cor <- stats::cor(df, use="pairwise.complete.obs") # works obviously
# vectorised
variances <- apply(df, 2, stats::var)
list(b=sign(df.cor) * sqrt(outer(variances, variances, "/")), # slope = sign(s_xy) s_y
corre =df.cor) # the correlation
}
ptm <- proc.time()
Rel.clr.t <- as.data.frame(clr(Rel.t))
proc.time() - ptm
## user system elapsed
## 0.140 0.005 0.145
Rel.sma <- sma.df(Rel.clr.t)
Theta <- 1 + (Rel.sma$b)^2 -2*Rel.sma$b*sqrt(Rel.sma$r2)
# like this it is all.equal(Theta, Rel.phi)
ThetaAbs <- 1 + (Rel.sma$b)^2 -2*abs(Rel.sma$b)*sqrt(Rel.sma$r2)
proc.time() - ptm
## user system elapsed
## 8.823 0.599 9.765
ptm <- proc.time()
Rel.clr_Self <- mutate_each(Rel, funs(clrSelf))
rownames(Rel.clr_Self) <- rownames(Rel)
Rel.clr_Self.t <- as.data.frame(t(Rel.clr_Self))
Rel.sma <- sma.df.Short(Rel.clr_Self.t)
ThetaS <- 1 + (Rel.sma$b)^2 -2*abs(Rel.sma$b)*Rel.sma$corre
ThetaAbsS <- 1 + (Rel.sma$b)^2 -2*abs(Rel.sma$b)*abs(Rel.sma$corre)
ThetaTestS <- 1 + (Rel.sma$b)^2 -2*Rel.sma$b*Rel.sma$corre
proc.time() - ptm
## user system elapsed
## 1.783 0.205 1.998