Overview of the simLDSC function

In this tutorial, we will guide you through the steps to run the simLDSC function, which allows you to simulate Genome-Wide Association Studies (GWAS) summary data for multiple phenotypes based on Linkage Disequilibrium Score Regression (LDSC), i.e.,

where

and represents the Z statistics for phenotype k and SNP j, Nk is the sample size for phenotype k, M is the number of SNPs including in the LD file, ℓ(j) is the LD score of SNP j (that is, the sum of squared correlations between the SNP and all other SNPs), Ns is the number of overlapping individuals in the corresponding phenotypes, ρ is the phenotypic correlation within the overlapping individuals, and α represents unmeasured sources of confounding such as population stratification.

Direct generation of summary statistics allows to consider an expansive set of replications and conditions that would be computationally prohibitive to simulate using a framework in which raw phenotype data were first generated for individual genomes and then submitted to GWAS. Moreover, summary data simulated under the LDSC model has the properties needed for analysis of genetic architecture by LDSC and for meta-analysis of effect sizes across phenotypes on a per-variant basis. Importantly, like any simulation approach, our approach also lacks some nuances of real data. For example, while we simulate summary data as a function of LD scores, we do not simulate summary data directly according to linkage disequilibrium (LD) structure (this would require the expansion of the covariance matrix of Z statistics across phenotypes to include cross-SNP covariances, which would dramatically increase computational burden). Thus, the simulated summary data are not appropriate for applications that are directly based on LD structure such as clumping and pruning, identification of lead SNPs within loci, or estimation of heritability using methods such as HDL and HESS, which directly rely on LD structure (rather than simply on LD scores). Similarly, because SEs in the block jackknife are affected by LD structure, simLDSC is not appropriate for estimating absolute power. However, because LD structure of data is constant across models, simLDSC can be used to estimate the relative power across analytic models.

We introduced this method in this paper:

de La Fuente, Javier, et al. “Integrated analysis of direct and proxy genome wide association studies highlights polygenicity of Alzheimer’s disease outside of the APOE region.” PLoS Genetics 18.6 (2022): e1010208.

Running the simLDSC function

The simLDSC function requires several arguments to be provided. Let’s go through each of them and understand their purpose:

# Specify the model from which the population genetic covariance matrix is derived
mod <- "F1 =~ 1*Pheno1 + 0.50*Pheno2 + 0.50*Pheno3
F1 ~~ 0.10*F1
Pheno1 ~~ 0*Pheno1
Pheno2 ~~ 0*Pheno2
Pheno3 ~~ 0*Pheno3"

# Alternatively, provide the population genetic covariance matrix directly
Spop <- matrix(NA, 3, 3)
diag(Spop) <- c(0.10, 0.025, 0.025)
Spop[lower.tri(Spop, diag = FALSE)] <- c(0.05, 0.05, 0.025)
Spop[upper.tri(Spop, diag = FALSE)] <- c(0.05, 0.05, 0.025)
# Define matrix with sample size across phenotypes (full matrix)
N <- matrix(NA, 3, 3)
diag(N) <- c(75000, 300000, 300000)

# Specify sample overlap across phenotypes in the off-diagonal elements (no sample overlap in this example)
N[lower.tri(N, diag = FALSE)] <- 0
N[upper.tri(N, diag = FALSE)] <- 0
rPheno <- matrix(0.5, 3, 3)
diag(rPheno) <- 1
int <- c(1.02, 1.01, 1.03)
ld <- "eur_w_ld_chr/"

Now that we have a good understanding of the function’s arguments, let’s run the simLDSC function:

# Run the `simLDSC` function
simLDSC(covmat = mod, N = N, rPheno = rPheno, int = int, ld = ld)

The function will output compressed GWAS summary statistics files for the k phenotypes across the number of iterations defined. For example, using the default number of replications (i.e., r = 1), we will get 3 phenotypes x 1 replication = 3 summary statistics files. The file names for phenotypes 1-3 in replication 1 will be iter1.GWAS1.sumstats.gz, iter1.GWAS2.sumstats.gz, and iter1.GWAS3.sumstats.gz, respectively.