Citation to Preprint with Technical Details

de la Fuente J. et al. (2026). Ultra-Fast Implementation of Multivariate GWAS in Genomic SEM Using Flexible Analytic Estimation [Preprint].

The preprint provides the full mathematical derivation of the analytic estimator, benchmarking results, and empirical validation against the standard iterative estimator. This tutorial focuses on practical application of the function.


⚠️ CAVEAT EMPTOR — Please read before use

The analytic estimation option in userGWAS is currently in alpha and under active development. It has not yet undergone the full validation program planned for the beta version. Our intention in releasing this alpha version is to enable interested members of the community to work with the function early and provide feedback on ease of use, requested features, and any unexpected behavior before we advance to a more complete version suitable for broader beta testing.

What this means in practice:

  • The function’s interface, arguments, and output may change without notice in future versions. The alpha release does not guarantee stability of any aspect of the function.
  • Results should be cross-validated against the standard iterative userGWAS function for a random subset of SNPs (e.g., ~1,000 randomly selected SNPs and a sample of top hits) before drawing scientific conclusions.
  • Users should exercise particular caution in high-dimensional applications (25+ phenotypes), for which iterative cross-validation is not currently feasible. Validity studies for very high-dimensional settings are planned for future work.
  • We actively welcome feedback. If you encounter unexpected results, errors, or have suggestions for features, please open an issue on GitHub or contact the authors directly (). Your input will directly shape the development of the beta version.

Overview

The userGWAS function in GenomicSEM now includes an analytic estimation option that performs multivariate GWAS using a closed-form Generalized Least Squares (GLS) estimator for the most computationally intensive portion of model estimation, replacing the iterative numerical optimization that userGWAS previously relied on exclusively.

The analytic estimator is activated by setting analytic = TRUE in the userGWAS call. The default remains analytic = FALSE (iterative estimation) to preserve backward compatibility during the alpha period.

The core idea

Standard Genomic SEM multivariate GWAS fits a separate structural equation model for each of millions of SNPs using an iterative estimator. This is accurate but computationally demanding, typically requiring access to a high-performance computing (HPC) cluster.

The analytic option takes advantage of the fact that the model can be decomposed into two independent parts:

  1. The measurement model (factor loadings, factor covariances) — depends only on the LDSC output, is the same for every SNP, and requires optimizing a nonlinear system of equations. This is estimated once with numerical optimization using the standard usermodel() function.
  2. The structural model (SNP → factor regressions) — must be estimated separately for each SNP, but constitutes a linear system of equations for which a direct analytic GLS solution exists.

By substituting an analytic GLS estimator for the iterative estimator in step 2, run-times are reduced by over 800 times without meaningfully changing the results. Consequently, analytic estimation does not require an HPC environment — researchers can run full genome-wide analyses on standard personal computers.


Notes on model specification

The measurement model (i.e., the portion of the model that does not involve the SNP) should include only GWAS phenotypes, first-order factors, factor loadings (=~), and factor covariances (~~). Higher-order factors and regressions between factors should not be explicitly included. However, nearly all models containing such terms can be reparameterized as first-order measurement models.

The structural model (i.e., the portion of the model that involves the SNP) should include only SNP → factor regressions (~ SNP). To specify a SNP → GWAS phenotype regression directly, the measurement model should include a factor representing that phenotype, with the residual variance of the phenotype set to zero.

A key check when reparametrizing a model is to fit the no-SNP versions of both the original and reparameterized models in usermodel() and confirm that their model fit, degrees of freedom, and parameter estimates are equivalent. In future releases, we plan to provide examples of how several common models (e.g., hierarchical factor models, mediation models, GWAS-by-subtraction models) can be parametrized to meet these requirements.


Computational gains

**Runtime comparison for ~20K SNPs: iterative vs. analytic estimation. Data from the preprint.**

Runtime comparison for ~20K SNPs: iterative vs. analytic estimation. Data from the preprint.


Installation

Install GenomicSEM directly from GitHub:

if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("GenomicSEM/GenomicSEM")

Using analytic estimation in userGWAS

Analytic estimation is activated by adding analytic = TRUE to your existing userGWAS call. All other arguments remain the same as in standard iterative usage.

userGWAS(
  covstruc   = LDSCoutput,   # ldsc() output
  SNPs       = sumstats,     # sumstats() output
  model      = model,        # lavaan-style syntax (including SNP lines)
  usermod    = NULL,         # pre-fitted no-SNP model output. Default is NULL.
  analytic   = TRUE,         # activates the analytic estimator
  batch_size = 100000        # number of SNPs processed per batch (tune to available RAM)
)
New argument Description
analytic Logical. If TRUE, uses the closed-form GLS analytic estimator. Default is FALSE (iterative).
batch_size Integer. Number of SNPs per batch. Larger values are faster but require more RAM. Default is 100,000.
usermod Optional. Pre-fitted no-SNP model output (see below). Default is NULL.

All other userGWAS arguments are unchanged. See ?userGWAS for the full argument list.

The usermod argument

By default, usermod = NULL and the function internally fits the no-SNP measurement model from the lavaan syntax provided in model. If you have already fitted the no-SNP model separately — for example, to inspect model fit or validate the factor structure before running the GWAS — you can pass the $results element returned by usermodel() directly via usermod. The function will then skip internal estimation of the no-SNP model and use the supplied parameter estimates to extract factor loadings.

When usermod is provided, there is no need to include the model syntax, as the function estimates SNP effects on all factors detected in the provided model output.

# Step 1: fit the no-SNP model separately
nosnp_fit <- usermodel(
  LDSCoutput,
  estimation = "DWLS",
  model      = nosnp_model
)

# Step 2: pass the results directly
userGWAS(
  sumstats   = sumstats,
  LDSCoutput = LDSCoutput,
  usermod    = nosnp_fit$results,
  batch_size = 100000
)

Note: The behavior of usermod — specifically, which factors receive a SNP effect when it is provided — may be refined in future versions of the function.


Output columns

When analytic = TRUE, userGWAS returns a single data frame with one row per SNP (rather than a list of data frames as in iterative mode). The following columns are returned for each factor F specified in the ~ SNP lines of your model:

Column Description
beta_F Estimated effect of the SNP on factor F
SE_F Standard error of beta_F
Z_beta_F Z-statistic for beta_F
p_val_F Two-sided p-value for beta_F
Q_omnibus Omnibus heterogeneity statistic across all traits
Q_omnibus_df Degrees of freedom for Q_omnibus (= kf, where k is the number of phenotypes and f the number of factors)
Q_omnibus_pval p-value for Q_omnibus

Worked example: 5-factor psychiatric GWAS

This example reproduces the analysis from Grotzinger et al. (2023) using a 5-factor model of common psychiatric disorders across 13 phenotypes.

Data

The two files required to run this example can be downloaded below. Place both files in your R working directory before running the code.

  • LDSC_PSYCH.RData — Pre-computed LDSC output containing the genetic covariance matrix and sampling covariance matrix for the 5-factor psychiatric model.
  • Psych_sumstats.txt — Merged GWAS summary statistics for 13 psychiatric phenotypes, formatted for userGWAS input.
library(data.table)
library(GenomicSEM)

load("LDSC_PSYCH.RData")       # loads LDSC_P
sumstats <- fread("Psych_sumstats.txt", data.table = FALSE)

Model specification

The model follows standard lavaan syntax. Lines ending in ~ SNP define the factors that will receive a SNP effect estimate. All other lines define the factor structure and are estimated once from the no-SNP model.

Reminder: The measurement model should include only GWAS phenotypes, first-order factors, factor loadings, and factor covariances. The structural model should include only SNP → factor regressions.

model <- '
  Comp  =~  OCD + TS + AN
  Psych =~  a*SCZ + a*BIP
  Neuro =~  ADHD + MDD + TS + ASD + PTSD
  Int   =~  MDD + ANX + PTSD
  SUD   =~  ALCH + CUD + OUD
  Comp  ~~  Psych + Neuro + Int + SUD
  Psych ~~  Neuro + Int + SUD
  Neuro ~~  Int + SUD
  Int   ~~  SUD

  Comp  ~   SNP
  Psych ~   SNP
  Neuro ~   SNP
  Int   ~   SNP
  SUD   ~   SNP
'

Note that a*SCZ + a*BIP imposes an equality constraint on the loadings of SCZ and BIP onto the Psychotic Disorders factor. The analytic estimator correctly handles constrained model syntax of this form.

Running the analysis

results <- userGWAS(
  covstruc   = LDSC_P,
  SNPs       = sumstats,
  model      = model,
  analytic   = TRUE,
  batch_size = 100000
)

Inspecting the output

The output is a single data frame with one row per SNP. The table below shows the first six rows from a real run of the 5-factor psychiatric GWAS model.

Table 1. First six rows of userGWAS(analytic = TRUE) output from the 5-factor psychiatric GWAS model (13 phenotypes, real data).
Compulsive (Comp)
Neurodevelopmental (Neuro)
Internalizing (Int)
Substance Use (SUD)
Psychotic (Psych)
Heterogeneity
SNP CHR BP MAF A1 A2 beta_Comp SE_Comp Z_beta_Comp p_val_Comp beta_Neuro SE_Neuro Z_beta_Neuro p_val_Neuro beta_Int SE_Int Z_beta_Int p_val_Int beta_SUD SE_SUD Z_beta_SUD p_val_SUD beta_Psych SE_Psych Z_beta_Psych p_val_Psych Q_omnibus Q_omnibus_df Q_omnibus_pval
rs12089192 1 1171022 0.1183 G C -0.0149 0.0179 -0.8299 0.4066 -0.0002 0.0119 -0.0128 0.9898 -0.0072 0.0035 -2.0667 0.0388 0.0162 0.0115 1.4040 0.1603 -0.0121 0.0062 -1.9387 0.0525 8.680 8 0.3700
rs78446752 1 1183397 0.0666 C T -0.0093 0.0235 -0.3952 0.6927 0.0131 0.0152 0.8643 0.3874 0.0001 0.0043 0.0325 0.9741 0.0144 0.0143 1.0059 0.3145 -0.0228 0.0080 -2.8315 0.0046 6.186 8 0.6264
rs2494426 1 2337537 0.1203 C G -0.0156 0.0176 -0.8837 0.3769 0.0082 0.0124 0.6602 0.5091 0.0100 0.0037 2.6936 0.0071 -0.0028 0.0114 -0.2414 0.8092 -0.0039 0.0067 -0.5869 0.5573 6.487 8 0.5929
rs12727713 1 3095631 0.0457 G A 0.0061 0.0277 0.2200 0.8259 0.0327 0.0202 1.6164 0.1060 -0.0068 0.0047 -1.4479 0.1476 -0.0007 0.0177 -0.0391 0.9688 -0.0120 0.0091 -1.3238 0.1856 6.082 8 0.6380
rs2455103 1 3178416 0.1421 A G -0.0061 0.0150 -0.4054 0.6852 -0.0032 0.0106 -0.3026 0.7622 0.0029 0.0033 0.8846 0.3764 -0.0059 0.0104 -0.5691 0.5693 -0.0130 0.0057 -2.2629 0.0236 15.747 8 0.0461
rs12097610 1 3812661 0.1829 C T -0.0082 0.0138 -0.5975 0.5501 0.0022 0.0098 0.2263 0.8210 -0.0028 0.0031 -0.9051 0.3654 -0.0087 0.0092 -0.9491 0.3426 0.0061 0.0053 1.1678 0.2429 4.580 8 0.8014

Providing feedback

We actively welcome feedback from users. If you:

  • encounter unexpected results or errors,
  • have suggestions for new features, or
  • would like to report a use case where the function behaved unexpectedly,

please open an issue on the GenomicSEM GitHub repository or contact the authors directly (). Your input will directly shape the development of the beta version.