simLDSC functionIn this tutorial, we will guide you through the steps to run the
simLDSC function, which allows you to simulate Genome-Wide
Association Studies (GWAS) summary data for multiple phenotypes based on
Linkage Disequilibrium Score Regression (LDSC), i.e.,
where
and represents the Z statistics
for phenotype k and SNP j, Nk is the sample size for phenotype k, M is
the number of SNPs including in the LD file, ℓ(j) is the LD score of SNP
j (that is, the sum of squared correlations between the SNP and all
other SNPs), Ns is the number of overlapping individuals in the
corresponding phenotypes, ρ is the phenotypic correlation within the
overlapping individuals, and α represents unmeasured sources of
confounding such as population stratification.
Direct generation of summary statistics allows to consider an expansive set of replications and conditions that would be computationally prohibitive to simulate using a framework in which raw phenotype data were first generated for individual genomes and then submitted to GWAS. Moreover, summary data simulated under the LDSC model has the properties needed for analysis of genetic architecture by LDSC and for meta-analysis of effect sizes across phenotypes on a per-variant basis. Importantly, like any simulation approach, our approach also lacks some nuances of real data. For example, while we simulate summary data as a function of LD scores, we do not simulate summary data directly according to linkage disequilibrium (LD) structure (this would require the expansion of the covariance matrix of Z statistics across phenotypes to include cross-SNP covariances, which would dramatically increase computational burden). Thus, the simulated summary data are not appropriate for applications that are directly based on LD structure such as clumping and pruning, identification of lead SNPs within loci, or estimation of heritability using methods such as HDL and HESS, which directly rely on LD structure (rather than simply on LD scores). Similarly, because SEs in the block jackknife are affected by LD structure, simLDSC is not appropriate for estimating absolute power. However, because LD structure of data is constant across models, simLDSC can be used to estimate the relative power across analytic models.
We introduced this method in this paper:
simLDSC functionThe simLDSC function requires several arguments
to be provided. Let’s go through each of them and understand
their purpose:
covmatΣ) contains
SNP-based heritabilities (h2) for the k phenotypes on the
diagonal elements and SNP-based coheritabilities (genetic covariances)
on the off-diagonal elements. This matrix can be provided directly as a
matrix or data.frame object or derived from a fully parameterized model
specified in lavaan syntax.# Specify the model from which the population genetic covariance matrix is derived
mod <- "F1 =~ 1*Pheno1 + 0.50*Pheno2 + 0.50*Pheno3
F1 ~~ 0.10*F1
Pheno1 ~~ 0*Pheno1
Pheno2 ~~ 0*Pheno2
Pheno3 ~~ 0*Pheno3"
# Alternatively, provide the population genetic covariance matrix directly
Spop <- matrix(NA, 3, 3)
diag(Spop) <- c(0.10, 0.025, 0.025)
Spop[lower.tri(Spop, diag = FALSE)] <- c(0.05, 0.05, 0.025)
Spop[upper.tri(Spop, diag = FALSE)] <- c(0.05, 0.05, 0.025)
N# Define matrix with sample size across phenotypes (full matrix)
N <- matrix(NA, 3, 3)
diag(N) <- c(75000, 300000, 300000)
# Specify sample overlap across phenotypes in the off-diagonal elements (no sample overlap in this example)
N[lower.tri(N, diag = FALSE)] <- 0
N[upper.tri(N, diag = FALSE)] <- 0
rPhenorPheno <- matrix(0.5, 3, 3)
diag(rPheno) <- 1
intint <- c(1.02, 1.01, 1.03)
ldld <- "eur_w_ld_chr/"
N_overlapThe proportion of sample overlap between samples (off-diagonal elements). This argument is only effective if N is input as a single integer value. The default value is 0.99, representing a situation in which all data come from a single sample with only minimal missing data.
rThe number of replications for the simulation. The default value is 1.
seedThe seed to be used for random number generation. The default is
seed <- 1234
gzip_outputA logical value indicating whether the simulated GWAS summary
statistics should be compressed in .gz format. The default value is
TRUE.
parallelA logical value indicating whether the function should be run in
parallel. The default value is FALSE.
coresWhen running in parallel, this argument determines the number of
cores to be used. If parallel is set to TRUE, the default
is to use either 1 less than the total number of available cores or use
as many cores as there are traits if the number of available cores
exceeds the number of traits.
Now that we have a good understanding of the function’s arguments,
let’s run the simLDSC function:
# Run the `simLDSC` function
simLDSC(covmat = mod, N = N, rPheno = rPheno, int = int, ld = ld)
The function will output compressed GWAS summary statistics files for
the k phenotypes across the number of iterations defined. For example,
using the default number of replications (i.e., r = 1), we will get 3
phenotypes x 1 replication = 3 summary statistics files. The file names
for phenotypes 1-3 in replication 1 will be
iter1.GWAS1.sumstats.gz,
iter1.GWAS2.sumstats.gz, and
iter1.GWAS3.sumstats.gz, respectively.