Overview of the `simLDSC` function

In this tutorial, we will guide you through the steps to run the simLDSC function, which allows you to simulate Genome-Wide Association Studies (GWAS) summary data for multiple phenotypes based on Linkage Disequilibrium Score Regression (LDSC), i.e.,

a title where

a title

and a title represents the Z statistics for phenotype k and SNP j, Nk is the sample size for phenotype k, M is the number of SNPs including in the LD file, ℓ(j) is the LD score of SNP j (that is, the sum of squared correlations between the SNP and all other SNPs), Ns is the number of overlapping individuals in the corresponding phenotypes, ρ is the phenotypic correlation within the overlapping individuals, and α represents unmeasured sources of confounding such as population stratification.

Direct generation of summary statistics allows to consider an expansive set of replications and conditions that would be computationally prohibitive to simulate using a framework in which raw phenotype data were first generated for individual genomes and then submitted to GWAS. Moreover, summary data simulated under the LDSC model has the properties needed for analysis of genetic architecture by LDSC and for meta-analysis of effect sizes across phenotypes on a per-variant basis. Importantly, like any simulation approach, our approach also lacks some nuances of real data. For example, while we simulate summary data as a function of LD scores, we do not simulate summary data directly according to linkage disequilibrium (LD) structure (this would require the expansion of the covariance matrix of Z statistics across phenotypes to include cross-SNP covariances, which would dramatically increase computational burden). Thus, the simulated summary data are not appropriate for applications that are directly based on LD structure such as clumping and pruning, identification of lead SNPs within loci, or estimation of heritability using methods such as HDL and HESS, which directly rely on LD structure (rather than simply on LD scores). Similarly, because SEs in the block jackknife are affected by LD structure, simLDSC is not appropriate for estimating absolute power. However, because LD structure of data is constant across models, simLDSC can be used to estimate the relative power across analytic models.

We introduced this method in this paper:

de La Fuente, Javier, et al. “Integrated analysis of direct and proxy genome wide association studies highlights polygenicity of Alzheimer’s disease outside of the APOE region.” PLoS Genetics 18.6 (2022): e1010208.

Running the `simLDSC` function

The simLDSC function requires several arguments to be provided. Let’s go through each of them and understand their purpose:

covmat
The population genetic covariance matrix (Σ) contains SNP-based heritabilities (h2) for the k phenotypes on the diagonal elements and SNP-based coheritabilities (genetic covariances) on the off-diagonal elements. This matrix can be provided directly as a matrix or data.frame object or derived from a fully parameterized model specified in lavaan syntax.

# Specify the model from which the population genetic covariance matrix is derived
mod <- "F1 =~ 1*Pheno1 + 0.50*Pheno2 + 0.50*Pheno3
F1 ~~ 0.10*F1
Pheno1 ~~ 0*Pheno1
Pheno2 ~~ 0*Pheno2
Pheno3 ~~ 0*Pheno3"

# Alternatively, provide the population genetic covariance matrix directly
Spop <- matrix(NA, 3, 3)
diag(Spop) <- c(0.10, 0.025, 0.025)
Spop[lower.tri(Spop, diag = FALSE)] <- c(0.05, 0.05, 0.025)
Spop[upper.tri(Spop, diag = FALSE)] <- c(0.05, 0.05, 0.025)

N
The sample size matrix (N) across phenotypes. It can be provided as a matrix or data.frame object with sample sizes in the diagonal elements and sample overlap between pairs of phenotypes in the off-diagonal elements.

# Define matrix with sample size across phenotypes (full matrix)
N <- matrix(NA, 3, 3)
diag(N) <- c(75000, 300000, 300000)

# Specify sample overlap across phenotypes in the off-diagonal elements (no sample overlap in this example)
N[lower.tri(N, diag = FALSE)] <- 0
N[upper.tri(N, diag = FALSE)] <- 0

rPheno
The phenotypic correlation matrix (rPheno) across the k phenotypes. It can be either a matrix or data.frame object with equal phenotypic correlations or a single numeric value ranging from 0 to 1 to be used to fill a correlation matrix with equal phenotypic correlations across phenotypes. The default is the population genetic correlation matrix.

rPheno <- matrix(0.5, 3, 3)
diag(rPheno) <- 1

int
A numeric vector of length equal to the number of phenotypes (k) containing the LDSC intercepts across phenotypes. It can also be provided as a single numeric value to assume the same intercept across all phenotypes. Setting LDSC intercept values of 1 assumes no genetic confounding, while values above 1 introduce confounding in the simulated summary statistics (e.g., population stratification).

int <- c(1.02, 1.01, 1.03)

ld
The folder containing LD scores across SNPs used as the independent variable in LDSC. The folder should be located in the working directory. You can use the default LD scores provided in the “eur_w_ld_chr” folder, or supply your own if desired.

ld <- "eur_w_ld_chr/"

N_overlap

The proportion of sample overlap between samples (off-diagonal elements). This argument is only effective if N is input as a single integer value. The default value is 0.99, representing a situation in which all data come from a single sample with only minimal missing data.
r

The number of replications for the simulation. The default value is 1.
seed

The seed to be used for random number generation. The default is seed <- 1234
gzip_output

A logical value indicating whether the simulated GWAS summary statistics should be compressed in .gz format. The default value is TRUE.
parallel

A logical value indicating whether the function should be run in parallel. The default value is FALSE.
cores

When running in parallel, this argument determines the number of cores to be used. If parallel is set to TRUE, the default is to use either 1 less than the total number of available cores or use as many cores as there are traits if the number of available cores exceeds the number of traits.

Now that we have a good understanding of the function’s arguments, let’s run the simLDSC function:

# Run the `simLDSC` function
simLDSC(covmat = mod, N = N, rPheno = rPheno, int = int, ld = ld)

The function will output compressed GWAS summary statistics files for the k phenotypes across the number of iterations defined. For example, using the default number of replications (i.e., r = 1), we will get 3 phenotypes x 1 replication = 3 summary statistics files. The file names for phenotypes 1-3 in replication 1 will be iter1.GWAS1.sumstats.gz, iter1.GWAS2.sumstats.gz, and iter1.GWAS3.sumstats.gz, respectively.

simLDSC Function Tutorial

Javier de la Fuente & Elliot M. Tucker-Drob

Oct 31 2023

Overview of the `simLDSC` function

Running the `simLDSC` function

`covmat`

`N`

`rPheno`

`int`

`ld`

`N_overlap`

`r`

`seed`

`gzip_output`

`parallel`

`cores`

simLDSC Function Tutorial

Javier de la Fuente & Elliot M. Tucker-Drob

Oct 31 2023

Overview of the simLDSC function

Running the simLDSC function

covmat

N

rPheno

int

ld

N_overlap

r

seed

gzip_output

parallel

cores

Overview of the `simLDSC` function

Running the `simLDSC` function

`covmat`

`N`

`rPheno`

`int`

`ld`

`N_overlap`

`r`

`seed`

`gzip_output`

`parallel`

`cores`