Variance Decomposition Methods for Gene Expression Data

Research Question

What is the relative contribution of each variance component to the total variance in gene expression?
Is expression variability primarily driven by inter-individual differences or by left-right expression differences?

Aim:

To decompose variance components in gene expression data from left and right (L/R) organs across species in the Bgee database by partitioning total variance into components:

For each gene:

Between individuals (R): Variance among individuals in right-side expression

Between individuals (L): Variance among individuals in left-side expression

Between R and L sides: Variance due to left-right asymmetry

By modeling mean expression as a function of R_expression, L_expression, and side, we would quantify the relative contribution of each variance component and determine whether expression variability is primarily driven by inter-individual differences or by left-right expression differences.

Method 1.1: Linear Mixed Models

Overview

The Linear Mixed Model (LMM) is a statistical approach that generalises linear regression by incorporating both fixed effects, which apply to the entire population, and random effects, which account for variability at group or subject levels. Unlike ordinary linear regression, which assumes independence between observations, LMMs allow for correlation within clusters of data.

LMMs are particularly useful in cases where the data structure violates the assumptions of ordinary linear models, such as when data are collected repeatedly from the same subjects over time or when data are grouped within clusters. By modelling both fixed and random effects, LMMs provide a more accurate reflection of the underlying relationships in complex data.

Assumptions: - The dependent variable is normally distributed. - The relationship between the dependent variable and independent variables is linear. - The random effects are normally distributed. - The observations are independent, conditional on the random effects. - The model does not exhibit multicollinearity among fixed effects.

1. Phenotypic variance partitioning by transcriptomic gene expression levels and environmental variables for anthropometric traits using GTEx data

Research Question

Phenotypic variation in human is the results of genetic variation and environmental influences. Understanding the contribution of genetic and environmental components to phenotypic variation is of great interest. The variance explained by genome-wide single nucleotide polymorphisms (SNPs) typically represents a small proportion of the phenotypic variance for complex traits, which may be because the genome is only a part of the whole biological process to shape the phenotypes. In this study, we propose to partition the phenotypic variance of three anthropometric traits, using gene expression levels and environmental variables from GTEx data.

Data type (Response)

Phenotypic variance ie., BMI

Variance Components

Gene expression (RNA-seq (GTEx v8), environmental effects (age, sex, ancestry, drinking status and smoking status) and residual

Package used

CORE REML (statistical method used in genetics to estimate variance and covariance components, specifically designed to address bias when random effects in a linear mixed model (LMM) are correlated) and REML

Key findings

We found that genetic factors play a significant role in determining body mass index (BMI), with the proportion of phenotypic variance explained by gene expression levels of visceral adipose tissue being 0.68 (SE=0.06). However, we also observed that environmental factors such as age, sex, ancestry, smoking status, and drinking alcohol status have a small but significant impact (0.005, SE=0.001). Interestingly, we found a significant negative correlation between the transcriptomic and environmental effects on BMI (transcriptome–environment correlation=−0.54, SE=0.14), suggesting an antagonistic relationship. This implies that individuals with lower genetic profiles may be more susceptible to the effects of environmental factors on BMI, while those with higher genetic profiles may be less susceptible. We also show that the estimated transcriptomic variance varies across tissues, e.g., the gene expression levels of whole blood tissue and environmental variables explain a lower proportion of BMI phenotypic variance (0.16, SE=0.05 and 0.04, SE=0.004 respectively). We observed a significant positive correlation between transcriptomic and environmental effects (1.21, SE=0.23) for this tissue. In conclusion, phenotypic variance partitioning can be done using gene expression and environmental data even with a small sample size (n=838 from GTEx data), which can provide insights into how the transcriptomic and environmental effects contribute to the phenotypes of the anthropometric traits.

R Implementation

REML

reml(y, v, x, data, RE.constraints = NULL, RE.startvalues = 0.1,
     RE.lbound = 1e-10, intervals.type = c("z", "LB"),
     model.name="Variance component with REML",
     suppressWarnings = TRUE, silent = TRUE, run = TRUE, ...)

2. variancePartition: interpreting drivers of variation in complex gene expression studies

Research Question

As large-scale studies of gene expression with multiple sources of biological and technical variation become widely adopted, characterizing these drivers of variation becomes essential to understanding disease biology and regulatory genetics.

we apply variancePartition to four well-characterized gene expression studies to demonstrate how the workflow facilitates interpretation of drivers of expression variation in complex study designs with multiple dimensions of variation. We illustrate how variancePartition enables rapid interpretation of the drivers of expression variation in these complex datasets.

Data type

RNA-seq data of multiple tissues tissues from the GTEx Consortium

Variance Components

Disease status, sex, cell or tissue type, ancestry, genetic background, experimental stimulus, and technical variables

Package used

variancePartition

Key findings

variancePartition quantifies variation in each expression trait attributable to differences in disease status, sex, cell or tissue type, ancestry, genetic background, experimental stimulus, or technical variables. Analysis of four large-scale transcriptome profiling datasets illustrates that variancePartition recovers striking patterns of biological and technical variation that are reproducible across multiple datasets.

R Implementation

# load library
library("variancePartition")

# load simulated data:
# geneExpr: matrix of gene expression values
# info: information/metadata about each sample
data(varPartData)

# Specify variables to consider
# Age is continuous so model it as a fixed effect
# Individual and Tissue are both categorical,
# so model them as random effects
# Note the syntax used to specify random effects
form <- ~ Age + (1 | Individual) + (1 | Tissue) + (1 | Batch)

# Fit model and extract results
# 1) fit linear mixed model on gene expression
# If categorical variables are specified,
#     a linear mixed model is used
# If all variables are modeled as fixed effects,
#       a linear model is used
# each entry in results is a regression model fit on a single gene
# 2) extract variance fractions from each model fit
# for each gene, returns fraction of variation attributable
#       to each variable
# Interpretation: the variance explained by each variables
# after correcting for all other variables
# Note that geneExpr can either be a matrix,
# and EList output by voom() in the limma package,
# or an ExpressionSet
varPart <- fitExtractVarPartModel(geneExpr, form, info)

# sort variables (i.e. columns) by median fraction
#       of variance explained
vp <- sortCols(varPart)

# Figure 1a
# Bar plot of variance fractions for the first 10 genes
plotPercentBars(vp[1:10, ])

https://www.bioconductor.org/packages/release/bioc/vignettes/variancePartition/inst/doc/variancePartition.html

3. Modeling methylation data as an additional genetic variance component

Research Question

Methylation sites provide a bridge for the investigation of real-time environmental contributions on genomic events by the alteration of methylation status of those sites. Using the data provided by GAW20’s organization committee, we calculated the heritability estimates of each cytosine-phosphate-guanine (CpG) island before and after the use of fenofibrate, a lipid-control drug.

Data type

Methylation data

Variance Components

Phenotypic variance, residual additive effect of polygenes, the gene-specific methylation kernel effect, and a random environmental effect

Sex and age as covariates

Package used

Maximum likelihood estimates (assuming a multivariate normal probability density) and likelihood ratio test by SOLAR

Key findings

We identified an interesting association on the gene TMEM5 (p<5.9×10−5). This gene encodes a highly conserved transmembrane protein almost exclusively expressed in the pancreatic tissue, but until our study, had not been associated with HDL blood concentration [14]. The status of methylation sites near TMEM5 could explain almost 4% of the observed phenotypic variability of HDL.

R Implementation

All algorithms used in this scientific manuscript are available upon request.

4. Analysis of variance components in gene expression data

Research Question

This study presents a variance-components approach to investigating animal-to-animal, between-array, within-array and day-to-day variations for two data sets.

Data type

Microarray

Variance Components

Within-array variation Between-array variation

Package used

REML

Key findings

In the first data set, we found that the between-array variance is greater than the between-section variance, which, in turn, is greater than the within-section variance. In the second data set, for the reference samples, the week-to-week variance is larger than the between-array variance, which, in turn, is slightly larger than the within-array variance. For the test samples, the week-to- week variance has the largest variation. The animal-to-animal variance is slightly larger than the between-array and within- array variances. However, in a gene-by-gene analysis, the animal-to-animal variance is smaller than the between-array variance in four out of five housekeeping genes. In summary, the largest variation observed is the week-to-week effect. Another important source of variability is the animal-to-animal variation.

Finally, we describe the use of variance-component estimates to determine optimal numbers of animals, arrays per animal and sections per array in planning microarray experiments.

R Implementation

5. Gene expression signature predicts rate of type 1 diabetes progression

Research Question

Type 1 diabetes is a complex heterogenous autoimmune disease without therapeutic interventions available to prevent or reverse the disease. This study aimed to identify transcriptional changes associated with the disease progression in patients with recent-onset type 1 diabetes.

Data type

RNA-seq data (separately for each gene, with the gene expression level as the dependent variable)

Variance Components

sex, and age were treated as fixed effects and individual, sequencing pool and study site as random effects

Package used

lmerTest R package

Key findings

We found that genes and pathways related to innate immunity were downregulated during the first year after diagnosis. Significant associations of the gene expression changes were found with ZnT8A autoantibody positivity. Rate of change in the expression of 16 genes between baseline and 12 months was found to predict the decline in C-peptide at 24 months. Interestingly and consistent with earlier reports, increased B cell levels and decreased neutrophil levels were associated with the rapid progression.

R Implementation

## load lmerTest package
library(lmerTest)

## Fit linear mixed model to the ham data:
fm <- lmer(Informed.liking ~ Gender + Information * Product + (1 | Consumer) +
             (1 | Consumer:Product), data=ham)

## Summary including coefficient table with p-values for t-statistics using
## Satterthwaite's method for denominator degrees of freedom:
summary(fm)

## Type III anova table with p-values for F-tests based on Satterthwaite's
## method:
(aov <- anova(fm))

## Inspect the contrast matrix for the Type III test of Product:
show_tests(aov, fractions = TRUE)$Product

## Choose type II anova table with Kenward-Roger method for the F-test:
# }
# NOT RUN {
if(requireNamespace("pbkrtest", quietly = TRUE))
  anova(fm, type=2, ddf="Kenward-Roger")
# }
# NOT RUN {
## Anova-like table of random-effect terms using likelihood ratio tests:
ranova(fm)

## F-tests of 'single term deletions' for all marginal terms:
drop1(fm)

## Least-Square means and pairwise differences:
(lsm <- ls_means(fm))
ls_means(fm, which = "Product", pairwise = TRUE)

## ls_means also have plot and as.data.frame methods:
# }
# NOT RUN {
plot(lsm, which=c("Product", "Information"))
as.data.frame(lsm)
## Inspect the LS-means contrasts:
show_tests(lsm, fractions=TRUE)$Product
# }
# NOT RUN {
## Contrast test (contest) using a custom contrast:
## Here we make the 2-df joint test of the main effects of Gender and Information
(L <- diag(length(fixef(fm)))[2:3, ])
contest(fm, L = L)

## backward elimination of non-significant effects:
step_result <- step(fm)

## Elimination tables for random- and fixed-effect terms:
step_result

# Extract the model that step found:
final_model <- get_model(step_result)

# }

6. A mixed-effects model of the dynamic response of muscle gene transcript expression to endurance exercise

Research Question

Altered expression of a broad range of gene transcripts after exercise reflects the specific adjustment of skeletal muscle makeup to endurance training. Towards a quantitative understanding of this molecular regulation, we aimed to build a mixed-effects model of the dynamics of co-related transcript responses to exercise. It was built on the assumption that transcript levels after exercise varied because of changes in the balance between transcript synthesis and degradation.

Data type

Microarray

Variance Components

mean characteristics (differences in the response magnitude) of the studied group of gene transcripts and between-gene variance within the group

Package used

Key findings

Predicted variation in transcription rate induced by exercise yielded a difference in amplitude and time-to-peak response of gene transcripts between the two groups before training and with training in Group 2. The findings illustrate that a mixed-effects model of transcript responses to exercise is suitable to explore the regulation of muscle plasticity by training at the transcriptional level and indicate critical experiments needed to consolidate model parameters empirically.

R Implementation

Method 1.2: Linear Mixed Models (LMM with precision weights)

Overview

The fundamental premise of weighted linear regression lies in recognizing that not all data points contribute equally to the estimation process. In many practical scenarios, observations may differ in their reliability, precision, or representativeness. For instance, in meta-analysis, studies with larger sample sizes should carry more weight than smaller studies.

A Weighted Linear Mixed Model (WLMM) combines the power of standard Linear Mixed Models (LMMs) for complex, hierarchical data with weights, assigning different importance to data points, observations, or groups, often to handle outliers, non-constant variances (heteroscedasticity), or complex sampling designs, making inferences more robust and accurate.

1. Single-Cell Transcriptomics of Multi-Site Cell Therapy in Osteoarthritis: Tissue-Specific Traits and Treatment Correlations

Research Question

dreamlet was used to perform variance partition analysis of the influence of collection site on gene expression within cell types

Data type

ScRNA-seq

Variance Components

Collection site Therapeutic response status

Package used

dreamlet (dreamlet uses linear mixed models in dream to perform differential expression in single cell data)

Key findings

Substantial influence of collection site on BMAC gene expression variability, with a median between 10% and 25% in most BMAC cell types, for some genes over 50% of the variability due to site, most notably in MSCs and plasma B cells.

R Implementation

All data produced in the present study will be available online once published in a peer-reviewed journal.

908 lines of code for dreamlet: https://github.com/GabrielHoffman/dreamlet/blob/devel/R/dreamlet.R

Method 1.3: Univariate linear model (simple linear regression)

Overview

A statistical model used to analyze the relationship between a single dependent variable (outcome) and one or more independent variables (predictors). It specifically predicts a continuous outcome (y) based on the input (x) by fitting a straight line.

Assumptions: - Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.

Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.
Homoscedasticity: The residuals have constant variance at every level of x.
Normality: The residuals of the model are normally distributed.

1. A simple, scalable approach to building a cross-platform transcriptome atlas

Research Question

Gene expression atlases have transformed our understanding of the development, composition and function of human tissues. New technologies promise improved cellular or molecular resolution, and have led to the identification of new cell types, or better defined cell states. But as new technologies emerge, information derived on old platforms becomes obsolete. We demonstrate that it is possible to combine a large number of different profiling experiments summarised from dozens of laboratories and representing hundreds of donors, to create an integrated molecular map of human tissue.

Data type

ScRNA-seq

Variance Components

Platform effect

Package used

variancePartition

Key findings

The resulting atlas provides a multi-scaled approach to visualise and analyse the relationships between sets of genes and blood cell lineages, including the maturation and activation of leukocytes in vivo and in vitro.

R Implementation

Authors did not provide the code they used. See variancePartition implementation above

Method 1.4: linear mixed models (LMMs) using genome-based restricted maximum likelihood (GREML)

Overview

Linear mixed models (LMMs) using genome-based restricted maximum likelihood (GREML) allow both fixed and random effects. Classic LMMs assume independence between random effects, which can be violated. To relax the assumption, we introduced a generalised GREML, named CORE GREML, that explicitly estimates the covariance between random effects.

Assumptions: - SNP genotypes are standardized.

All classes of SNPs (common, rare, different LD levels) have the same average effect on the trait.
Independence between random effects.
Corrects for the bias towards zero in variance components that occurs in ordinary maximum likelihood estimation (MLE).

1. CORE GREML for estimating covariance between random effects in linear mixed models for complex trait analyses

Research Question

Following classic LMMs3,4, GREML assumes independence between random effects when estimating variance components. However, it is questionable if this assumption is always valid, especially for genomic analyses of complex traits. For example, gene regulatory networks shared between functional categories may generate non-negligible correlations between effects of these categories on phenotypes12. In the context of phenotypic variance partitioning by multi-omics layers, effects of genetic variants and their expression levels on phenotypes are likely correlated13,14,15, as exemplified by overlaps between GWAS loci and expression quantitative trait loci16,17 (eQTL). Given these justifiable covariance terms in genomic analyses of complex traits, the naive assumption of independence between random effects held by GREML can lead to a biased partition of phenotypic variance and false inferences on the underlying architecture of complex traits.

Data type

Genomic partitioning SNPs from coding regions, untranslated regions, and promotors (collectively referred to as “regulatory regions”)

Genome-transcriptome partitioning of phenotypic variance genetic variance estimation (constructed using genotypes of 1,133,273 genome-wide SNPs) and estimation of phenotypic variance explained by the transcriptome (was based on imputed expression levels of 227,664 genes from 43 tissues)

Variance Components

We fitted two models using GREML, a “G model” that breaks phenotypic effects into the random effects of the genome and residuals, i.e., y=g+ε, and a “G-T model” that decomposes phenotypic effects into the random effects of the genome and the imputed transcriptome and residuals, i.e., y=g+t+ε.

For each trait, Standing height, sitting height, body mass index (BMI), heel bone mineral density, fluid intelligence, weight, waist circumference, hip circumference, diastolic blood pressure, and years of education estimates, two separate sets of variance partitioning analyses, which are genomic partitioning by functional region and genome-transcriptome partitioning of phenotypic variance were conducted.

Package used

CORE GREML using MTG2

Key findings

Applying CORE GREML to UK Biobank data, we find, for example, that the transcriptome, imputed using genotype data, explains a significant proportion of phenotypic variance for height (0.15,p-value=1.5e-283), and that these transcriptomic effects correlate with the genomic effects (genome-transcriptome correlation=0.35,p-value=1.2e-14). We conclude that the covariance between random effects is a key parameter for estimation, especially when partitioning phenotypic variance by multi-omics layers.

R Implementation

Compute kernel matrix K

# covariate file
cov=read.table("toy.cov", header=F)
head(cov)
cov2=cov[,-1] # exclude 1st col, id
ncov=dim(cov2)[2] # no. of covariates

# standardization
cov3=data.frame(matrix(nrow=dim(cov2)[1], ncol=dim(cov2)[2]))
for (i in 1:ncov){
   sel=cov2[,i]
   ave=mean(sel, na.rm=T)
   std=sd(sel, na.rm=T)
   # apply standization
   cov3[,i]=(sel-ave)/(std*sqrt(ncov))
  }

# add in id
cov4=cbind(cov$V1, cov$V1, cov3)
# export dat
write.table(cov4, "toy.stdcov", col.names=F, row.names=F,quote=F)

Compute kernel matrix Q

ensure positive definite kernel matrices

# for A
./mtg2.15 -p toy.fam -g toy.grm -thread 80 -bend 2
output: toy.grm.bend
# for K
./mtg2.15 -p toy.fam -g toy.bmat -thread 80 -bend 2
output: toy.bmat.bend

cholesky decomposition

# for A
./mtg2.15 -p toy.fam -g toy.grm.bend -thread 80 -chol 1
ouput: toy.grm.bend.chol
# for K
./mtg2.15 -p toy.fam -g toy.bmat.bend -thread 80 -chol 1
output: toy.bmat.bend.chol

compute Q

/mtg2.15 -p toy.fam -mg toy_grm_bmat.chol -thread 80 -matmat 2

#inside toy_grm_bmat.chol:
#toy.grm.bend.chol
#toy.bmat.bend.chol
#output: toy_grm_bmat.chol.matmat2

Parameter estimation

#fit GREML:
./mtg2.15 -p toy.fam -mg toy_greml.matlist -d toy.dat -mod 1 -thread 100 -out toy_greml.out > toy_greml.log

#inside toy_greml.matlist:
#toy.grm
#toy.bmat
#inside toy_greml.out : see figure below

#fit CORE GREML:
./mtg2.15 -p toy.fam -mg toy_coregreml.matlist -d toy.dat -mod 1 -thread 100 -out toy_coregreml.out > toy_coregreml.log

#inside toy_coregreml.matlist:
#toy.grm
#toy.bmat
#toy_grm_bmat.chol.matmat2
#inside toy_coregreml.out : see figure below

Method 1.5: Comparison of linear model (GLS) of fixed effects/ linear model with fixed effects and correlated error, multivariate (OLS) linear model ✔️ and generalized estimating equation (GEE) models

Overview

A multivariate linear model is a statistical tool that predicts a single dependent variable (outcome) using two or more independent variables (predictors) in a linear equation, extending simple linear regression to handle more complex real-world scenarios where multiple factors influence an outcome.

Generalized estimation equations (GEE) are a semi-parametric alternative to Generalized Linear Mixed Effect Models (GLMM). They are semi-parametric because the parameter estimates are estimated parametrically, but the variances are not parametrically.

Usually, part of the process for defining the variance structure consists in defining the correlation structure of the data points within individual participants; for example, data collected sequentially over time could be modeled differently than data clustered spatially, e.g. over a number of different isolated areas. Correlation structures may include, among other things, an independent correlation structure that assumes that:

There is no correlation between data points
A composite symmetrical or interchangeable correlation structure that assumes that data from an individual participant is correlated within that participant, but all data points within the participants are equally correlated
An autoregressive correlation structure that assumes that data points within participants that are closer to each other in time are more correlated than data points that are further away.

1. Monte Carlo simulation of OLS and linear mixed model inference of phenotypic effects on gene expression

Research Question

Many self-contained gene set analysis methods have been developed but the performance of these methods for phenotypes that are continuous rather than discrete and with multiple nuisance covariates has not been well studied. Here, I use Monte Carlo simulation to evaluate the performance of both novel and previously published (and readily available via R) methods for inferring effects of a continuous predictor on mean expression in the presence of nuisance covariates. The motivating data are a high-profile dataset which was used to show opposing effects of hedonic and eudaimonic well-being (or happiness) on the mean expression level of a set of genes that has been correlated with social adversity (the CTRA gene set). The original analysis of these data used a linear model (GLS) of fixed effects with correlated error to infer effects of Hedonia and Eudaimonia on mean CTRA expression.

Data type

Monte Carlo simulation of data explicitly modeled on the re-analyzed dataset

Variance Components

Regression coefficients

Package used

Key findings

GLS estimates are inconsistent between data sets, and, in each dataset, at least one coefficient is large and highly statistically significant. By contrast, effects estimated by OLS or GEE are very small, especially relative to the standard errors. Bootstrap and permutation GLS distributions suggest that the GLS results in downward biased standard errors and inflated coefficients. The Monte Carlo simulation of error rates shows highly inflated Type I error from the GLS test and slightly inflated Type I error from the GEE test. By contrast, Type I error for all OLS tests are at the nominal level. The permutation F-tests have ∼1.9X the power of the other OLS tests. This increased power comes at a cost of high sign error (∼10%) if tested on small effects.

R Implementation

837 lines of code https://github.com/middleprofessor/happiness/blob/master/FredricksonReanalysis.R

Method 2: Multivariate GMM with variance decomposition

Overview

The Gaussian mixture model (GMM) is a parametric probabilistic model that assumes all data points are generated from a mixture of a finite number of Gaussian distributions. These distributions completely characterize the model, therefore it is composed of a weighted sum of Gaussian components. In particular, three parameters for each Gaussian component are sufficient to represent the whole of the information: mean, covariance, and weight. These parameters are estimated from training data using the iterative expectation maximization (EM) algorithm. EM is a statistical algorithm that iteratively finds locally maximum likelihood parameters of a probabilistic model when equations cannot be solved directly.

Assumptions: - Mixture of Gaussians: The underlying distribution of the entire dataset is assumed to be a combination of a finite number of Gaussian (Normal) distributions.

Gaussian Subpopulations: The data is generated from distinct “subpopulations” or clusters, and each subpopulation is modeled as a multivariate Gaussian distribution with its own mean vector , covariance matrix, and mixing weight.
Independent and Identically Distributed: Each data point is assumed to be generated independently from this mixture distribution.
Parameters are Unknown: The specific parameters—the means, covariance matrices, and the mixing proportions (how much each Gaussian contributes to the total data) are unknown and must be learned from the data, typically using the Expectation-Maximization (EM) algorithm.
Elliptical Cluster Shapes: Due to the use of covariance matrices, GMMs assume that the clusters are elliptical in shape, which allows them to model more complex, stretched, or overlapping clusters compared to K-Means.
Fixed Number of Components: The model requires the user to pre-specify the number of components to be used.
Optional Covariance Structure: Depending on the implementation (e.g., in scikit-learn), one can assume different structures for the covariance matrices, such as ‘spherical’ (equal variance in all directions), ‘diagonal’ (axes-aligned), ‘tied’ (all components share the same covariance), or ‘full’ (each component has its own general matrix).

1. Comparative Analysis of Multivariate Mixture Models for Clustering Cancer Expression Data

Research Question

Clustering gene expression data is essential for understanding tumor heterogeneity but is challenged by high dimensionality, noise, and class imbalance. We evaluated three Gaussian mixture model (GMM) based clustering methods: classical GMM, GMM with variance decomposition, and GMM with feature saliency on transcriptomic data from The Cancer Genome Atlas (TCGA) #### Data type TCGA gene expression data

Variance Components

To differentiate between these components (nformative signal and stochastic noise), we fitted a series of univariate Gaussian-mixture models by means of the expectation–maximisation (EM) algorithm, sequentially evaluating candidate models containing between 2 and 25 components. The optimal number of components for a given dataset was selected with the Bayesian Information Criterion (BIC) [17], which penalises excessive model complexity by taking into account both the number of free parameters and the effective sample size. After determining the best-fitting mixture, we identified the component with the highest mean, under the assumption that features belonging to this cluster carry the largest amount of relevant information.

Package used

Key findings

Variance decomposition performed best for the molecularly similar and balanced LUAD vs LUSC pair, capturing subtle expression differences.

This study highlights that while latent-variable mixture models are promising for transcriptomic clustering, their performance depends on data-specific factors such as imbalance and molecular similarity.

R Implementation

Method 3: Singular value decomposition

Overview

Singular Value Decomposition (SVD) is a general matrix factorization method in linear algebra that decomposes a matrix into three other matrices, providing a way to represent data in terms of its singular values. It reveals underlying structures, acting as a data reduction tool for dimensionality reduction (like PCA).

1. Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis

Research Question

It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes.

Data type

Expression data (including kidney samples from normal kidney tissue, the cortex and the medulla. All expression data were analyzed on the log scale.)

Variance Components

The goal of the SVA algorithm is therefore to identify and estimate the surrogate variables

Package used

SVA

Key findings

We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

R Implementation

The link to the package is broken

2. Analysis of region-specific changes in gene expression upon treatment with citalopram and desipramine reveals temporal dynamics in response to antidepressant drugs at the transcriptome level

Research Question

Through the experimental design of our study, we expected to highlight some unresolved questions regarding the molecular mechanisms of antidepressant action. First, what is the time-course of antidepressant action, i.e., which molecular targets are affected by a specific AD and at what time during treatment does it occur? What are the structure-specific changes in gene expression that occur during ADs treatment? Are there common end points of antidepressant action, i.e., do ADs belonging to different pharmacological classes target a final common molecular pathway? Finally, does acute treatment followed by a drug-free period of time induce similar molecular changes to those caused by repeated treatment?

Data type

RNA (expression analysed using RT-PCR)

Variance Components

mode of action of antidepressant drugs (The expression profile of any given gene is a linear combination of the obtained modes with coefficients equal to the corresponding elements of the matrix)

Package used

Key findings

Singular value decomposition analysis revealed differences in the global gene expression profiles between treatment types. The numbers of characteristic modes were generally smaller after CIT treatment than after DMI treatment. Analysis of the dynamics of gene expression revealed that the most significant changes concerned immediate early genes, whose expression was also visualized by in situ hybridization. Transcription factor binding site analysis revealed an over-representation of serum response factor binding sites in the promoters of genes that changed upon treatment with both ADs.

R Implementation

Method 4: Poisson mixed-effects model (POME model)

Overview

Poisson regression models are generalized linear models with the logarithm as the (canonical) link function, and the Poisson distribution function as the assumed probability distribution of the response. Negative binomial regression is a popular generalization of Poisson regression because it loosens the highly restrictive assumption that the variance is equal to the mean made by the Poisson model.

Assumptions: - The variance is equal to the mean.

The response variable is non-negative integer data.
The responses are independent from one another.
The responses occur over fixed time or space.

1. Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq

Research Question

RNA sequencing (RNA-Seq) is a powerful new technology for mapping and quantifying transcriptomes using ultra high-throughput next-generation sequencing technologies. Using deep sequencing, gene expression levels of all transcripts including novel ones can be quantified digitally. Although extremely promising, the massive amounts of data generated by RNA-Seq, substantial biases and uncertainty in short read alignment pose challenges for data analysis. In particular, large base-specific variation and between-base dependence make simple approaches, such as those that use averaging to normalize RNA-Seq data and quantify gene expressions, ineffective.

Data type

Microarray

Variance Components

Base-specific variation Between-base dependence

Package used

Key findings

In this study, we propose a Poisson mixed-effects (POME) model to characterize base-level read coverage within each transcript. The underlying expression level is included as a key parameter in this model. Since the proposed model is capable of incorporating base-specific variation as well as between-base dependence that affect read coverage profile throughout the transcript, it can lead to improved quantification of the true underlying expression level.

R Implementation

Method 5: Equilibrium model and a simplified ODE model

Overview

The “equilibrium model”, also known as the “single equilibrium model” or “engineering resilience”, can only be used for studying the behavior of linear systems. This is the primary and old model of resilience that relates to the capacity of bouncing back and returning to equilibrium.

Ordinary differential equation (ODE) models describe how system components change continuously over time. What makes them ordinary is that they only consider change with respect to one variable, usually time, in contrast to other types of differential equations which can describe changes across multiple variables (e.g. time and space).

1. A model of estrogen-related gene expression reveals non-linear effects in transcriptional response to tamoxifen

Research Question

Estrogen receptors alpha (ER) are implicated in many types of female cancers, and are the common target for anti-cancer therapy using selective estrogen receptor modulators (SERMs, such as tamoxifen). However, cell-type specific and patient-to-patient variability in response to SERMs (from suppression to stimulation of cancer growth), as well as frequent emergence of drug resistance, represents a serious problem. The molecular processes behind mixed effects of SERMs remain poorly understood, and this strongly motivates application of systems approaches. In this work, we aimed to establish a mathematical model of ER-dependent gene expression to explore potential mechanisms underlying the variable actions of SERMs.

Data type

Gene expression of reporter gene (ALP)

Variance Components

The amount of ER-dependent promoters (Dt), estrogen receptor level (ERt), and background hormone concentration (Ht).

To identify the key model parameters, whose variation may have the biggest impact on the transcriptional response to tamoxifen, we used global sensitivity analysis (GSA). The GSA technique allows exploration of the sensitivity of specific model outputs to simultaneous variation of multiple model parameters within broad parameter space, and has recently proven its value for the analysis of the range of biological systems

Package used

DBSolve package (Model construction and simulation)

Global sensitivity analysis made use of KINSOL solver of SUNDIALS

Key findings

Our findings suggest that transcriptional response to tamoxifen is controlled in a complex non-linear way by several key parameters, including ER expression level, hormone concentration, amount of ER-responsive genes and the capacity of ER-tamoxifen complexes to stimulate transcription (e.g. by recruiting co-regulators of transcription). The model revealed non-monotonic dependence of ER-induced transcriptional response on the expression level of ER, that was confirmed experimentally in four variants of the MCF-7 breast cancer cell line.

We established a minimal mechanistic model of ER-dependent gene expression, that predicts complex non-linear effects in transcriptional response to tamoxifen in the broad range of biologically plausible parameter values. Our findings suggest that the outcome of a SERM’s action is defined by several key components of cellular micro-environment, that may contribute to cell-type-specific effects of SERMs and justify the need for the development of combinatorial biomarkers for more accurate prediction of the efficacy of SERMs in specific cell types.

R Implementation

References

Almeida, M., Peralta, J., Garcia, J., Diego, V., Goring, H., Williams-Blangero, S., & Blangero, J. (2018). Modeling methylation data as an additional genetic variance component. BMC Proceedings, 12(9), 29. https://doi.org/10.1186/s12919-018-0128-7

Angel, P. W., Rajab, N., Deng, Y., Pacheco, C. M., Chen, T., Cao, K.-A. L., Choi, J., & Wells, C. A. (2020). A simple, scalable approach to building a cross-platform transcriptome atlas. PLOS Computational Biology, 16(9), e1008219. https://doi.org/10.1371/journal.pcbi.1008219

Busso, T., & Flück, M. (2013). A mixed-effects model of the dynamic response of muscle gene transcript expression to endurance exercise. European Journal of Applied Physiology, 113(5), 1279–1290. https://doi.org/10.1007/s00421-012-2547-x

Chatterjee, P., Stevens, H. Y., Kippner, L. E., Yeago, C., Drissi, H., Mautner, K., Boden, S., Gibson, G., & Roy, K. (2025). Single-Cell Transcriptomics of Multi-Site Cell Therapy in Osteoarthritis: Tissue-Specific Traits and Treatment Correlations (p. 2025.04.14.25325743). medRxiv. https://doi.org/10.1101/2025.04.14.25325743

Chen, J. J., Delongchamp, R. R., Tsai, C.-A., Hsueh, H., Sistare, F., Thompson, K. L., Desai, V. G., & Fuscoe, J. C. (2004). Analysis of variance components in gene expression data. Bioinformatics, 20(9), 1436–1446. https://doi.org/10.1093/bioinformatics/bth118

Fredrickson, B. L., Grewen, K. M., Algoe, S. B., Firestine, A. M., Arevalo, J. M. G., Ma, J., & Cole, S. W. (2015). Psychological Well-Being and the Human Conserved Transcriptional Response to Adversity. PLOS ONE, 10(3), e0121839. https://doi.org/10.1371/journal.pone.0121839

Gąska, M., Kuśmider, M., Solich, J., Faron-Górecka, A., Krawczyk, M. J., Kułakowski, K., & Dziedzicka-Wasylewska, M. (2012). Analysis of region-specific changes in gene expression upon treatment with citalopram and desipramine reveals temporal dynamics in response to antidepressant drugs at the transcriptome level. Psychopharmacology, 223(3), 281–297. https://doi.org/10.1007/s00213-012-2714-0

Hoffman, G. E., & Schadt, E. E. (2016). variancePartition: Interpreting drivers of variation in complex gene expression studies. BMC Bioinformatics, 17(1), 483. https://doi.org/10.1186/s12859-016-1323-z

Hu, M., Zhu, Y., Taylor, J. M. G., Liu, J. S., & Qin, Z. S. (2012). Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq. Bioinformatics (Oxford, England), 28(1), 63–68. https://doi.org/10.1093/bioinformatics/btr616

Jullian Fabres, P., & Lee, S. H. (2023). Phenotypic variance partitioning by transcriptomic gene expression levels and environmental variables for anthropometric traits using GTEx data. Genetic Epidemiology, 47(7), 465–474. https://doi.org/10.1002/gepi.22531

Lebedeva, G., Yamaguchi, A., Langdon, S. P., Macleod, K., & Harrison, D. J. (2012). A model of estrogen-related gene expression reveals non-linear effects in transcriptional response to tamoxifen. BMC Systems Biology, 6(1), 138. https://doi.org/10.1186/1752-0509-6-138

Leek, J. T., & Storey, J. D. (2007). Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLOS Genetics, 3(9), e161. https://doi.org/10.1371/journal.pgen.0030161

Suomi, T., Starskaia, I., Kalim, U. U., Rasool, O., Jaakkola, M. K., Grönroos, T., Välikangas, T., Brorsson, C., Mazzoni, G., Bruggraber, S., Overbergh, L., Dunger, D., Peakman, M., Chmura, P., Brunak, S., Schulte, A. M., Mathieu, C., Knip, M., Lahesmaa, R., … Gommers, L. (2023). Gene expression signature predicts rate of type 1 diabetes progression. eBioMedicine, 92. https://doi.org/10.1016/j.ebiom.2023.104625

Walker, J. A. (2016). Monte Carlo simulation of OLS and linear mixed model inference of phenotypic effects on gene expression. PeerJ, 4, e2575. https://doi.org/10.7717/peerj.2575

Widzisz, K., Kania, M., Zyla, J., & Polanski, A. (2025). Comparative Analysis of Multivariate Mixture Models for Clustering Cancer Expression Data. Proceedings of the 12th International Conference on Bioinformatics Research and Applications, 88–92. https://doi.org/10.1145/3774976.3774991

Zhou, X., Im, H. K., & Lee, S. H. (2020). CORE GREML for estimating covariance between random effects in linear mixed models for complex trait analyses. Nature Communications, 11(1), 4208. https://doi.org/10.1038/s41467-020-18085-5

Session Info

sessionInfo()

## R version 4.3.3 (2024-02-29)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Zurich
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.6.1          fastmap_1.2.0     xfun_0.52        
##  [5] cachem_1.1.0      knitr_1.48        htmltools_0.5.8.1 rmarkdown_2.29   
##  [9] lifecycle_1.0.4   cli_3.6.5         sass_0.4.9        jquerylib_0.1.4  
## [13] compiler_4.3.3    rstudioapi_0.16.0 tools_4.3.3       evaluate_1.0.1   
## [17] bslib_0.8.0       yaml_2.3.10       rlang_1.1.6       jsonlite_2.0.0

Variance Decomposition Methods for Gene Expression Data

2026-01-14

Research Question

Aim:

Method 1.1: Linear Mixed Models

Overview

1. Phenotypic variance partitioning by transcriptomic gene expression levels and environmental variables for anthropometric traits using GTEx data

Research Question

Data type (Response)

Variance Components

Package used

Key findings

R Implementation

2. variancePartition: interpreting drivers of variation in complex gene expression studies

Research Question

Data type

Variance Components

Package used

Key findings

R Implementation

3. Modeling methylation data as an additional genetic variance component

Research Question

Data type

Variance Components

Package used

Key findings

R Implementation

4. Analysis of variance components in gene expression data

Research Question

Data type

Variance Components

Package used

Key findings

R Implementation

5. Gene expression signature predicts rate of type 1 diabetes progression

Research Question

Data type

Variance Components

Package used

Key findings

R Implementation

6. A mixed-effects model of the dynamic response of muscle gene transcript expression to endurance exercise

Research Question

Data type

Variance Components

Package used

Key findings

R Implementation

Method 1.2: Linear Mixed Models (LMM with precision weights)

Overview

1. Single-Cell Transcriptomics of Multi-Site Cell Therapy in Osteoarthritis: Tissue-Specific Traits and Treatment Correlations

Research Question

Data type

Variance Components

Package used

Key findings

R Implementation

Method 1.3: Univariate linear model (simple linear regression)

Overview

1. A simple, scalable approach to building a cross-platform transcriptome atlas

Research Question

Data type

Variance Components

Package used

Key findings

R Implementation

Method 1.4: linear mixed models (LMMs) using genome-based restricted maximum likelihood (GREML)

Overview

1. CORE GREML for estimating covariance between random effects in linear mixed models for complex trait analyses

Research Question

Data type

Variance Components

Package used

Key findings

R Implementation

Method 1.5: Comparison of linear model (GLS) of fixed effects/ linear model with fixed effects and correlated error, multivariate (OLS) linear model ✔️ and generalized estimating equation (GEE) models

Overview

1. Monte Carlo simulation of OLS and linear mixed model inference of phenotypic effects on gene expression

Research Question

Data type