In this tutorial, we will guide you through the steps to run the rgmodel function. This function estimates the matrix of genetic correlations (R) and the corresponding matrix of their sampling covariances (V_R) from the output of the ldsc() function.

Understanding Standardized Genetic Covariance Matrix (S_Stand) vs. Genetic Correlations (R)

The S_Stand matrix and its associated sampling covariance matrix (V_Stand), is output of the ldsc() when stand = TRUE is specified (default = FALSE). S_Stand and V_Stand , are versions of the genetic covariance matrix (S) and its sampling covariance matrix (V) that have both been standardized relative to the heritabilities on the diagonal of S. The S_Stand matrix is typically more interpretable than the S matrix, and has the oftentimes desirable property of keeping the Z statistics and p-values unchanged relative to those associated with the corresponding elements of S.

We refer to S_Stand as the standardized genetic covariance matrix and treat it as conceptually distinct from the genetic correlation matrix. The key rationale for making this distinction relates to the fact that the sampling covariance matrix of S_Stand (i.e. V_stand) is simply a rescaled version of V, and does not correspond to the sampling covariance matrix of the genetic correlation matrix had it been directly estimated across independent samples from the same population (or across jackknife blocks). This is apparent in that V_Stand contains sampling variances (squared standard errors) of the standardized heritabilities. The standardized heritabilities are always, by definition, 1.0, but V_Stand allows them to have nonzero standard errors to express uncertainty in the heritability estimates, rescaled to a standardized metric. In contrast, a true genetic correlation matrix should not have standard errors associated with its diagonals, which are guaranteed to be 1.0 across independent samples and jacknknife blocks. The off-diagonals of the sampling covariance matrices of the standardized genetic covariance matrix vs. the genetic correlation matrix, are also expected to differ, albeit for less intuitive reasons.

The rgmodel( ) function provides an automated means of estimating a genetic correlation matrix (R) and its sampling covariance matrix (V_R) by fitting a genetic correlation model directly to ldsc( ) output. This model automatically specifies a standardized genetic factor (with variance = 1.0) for each phenotype, sets the residual genetic variance of each GWAS phenotype to 0, freely estimates all factor loadings, and allows all factors to freely covary. Because the covariances of factors with fixed variances of 1.0 are equivalent to correlations, this model provides direct estimates of genetic correlations, and the sampling covariances of the model parameters includes the correct sampling covariances of the genetic correlation matrix, which the function uses to construct V_R.

Running the rgmodel Function

The rgmodel function requires only the LDSCout argument, which is an object containing the output from the Genomic SEM multivariable LD Score regression ldsc() function. For the examples showcased in this tutorial, we will use the ldsc() output for two scenarios: one with twenty fours traits (X1–X24) having low-to-moderate sample overlap, and another with thirteen traits (X1–X13) having moderate-to-high sample overlap.

You can download the ldsc() output for these examples from the following links:

Let’s load the ldsc() output and run the rgmodel() function:

# Example 1: Low-to-moderate sample overlap
load("rgmodel_LDSC_ex1.RData") # Load LDSC output
LDSCoutputRG1 <- rgmodel(LDSCoutput = LDSCoutput_ex1) # Run rgmodel function
## [1] "Running primary model"
## [1] "Calculating CFI"
## [1] "Calculating Standardized Results"
## [1] "Calculating SRMR"
## elapsed 
##    20.5 
## [1] "Model fit statistics are all printed as NA as you have specified a fully saturated model (i.e., df = 0)"
## [1] "The S matrix was smoothed prior to model estimation due to a non-positive definite matrix. The largest absolute difference in a cell between the smoothed and non-smoothed matrix was  0.00672444591817246 As a result of the smoothing, the largest Z-statistic change for the genetic covariances was  0.34108468844048 . We recommend setting the smooth_check argument to true if you are going to run a multivariate GWAS."
# Example 2: Moderate-to-high sample overlap
load("rgmodel_LDSC_ex2.RData") # Load LDSC output
LDSCoutputRG2 <- rgmodel(LDSCoutput = LDSCoutput_ex2) # Run rgmodel function
## [1] "Running primary model"
## [1] "Calculating CFI"
## [1] "Calculating Standardized Results"
## [1] "Calculating SRMR"
## elapsed 
##    1.69 
## [1] "Model fit statistics are all printed as NA as you have specified a fully saturated model (i.e., df = 0)"
## [1] "The S matrix was smoothed prior to model estimation due to a non-positive definite matrix. The largest absolute difference in a cell between the smoothed and non-smoothed matrix was  0.00541591900985086 As a result of the smoothing, the largest Z-statistic change for the genetic covariances was  0.213387672883552 . We recommend setting the smooth_check argument to true if you are going to run a multivariate GWAS."

Output Objects of rgmodel()

The rgmodel() function creates a copy of the original ldsc object using the specified name (here LDSCoutputRG1 and LDSCoutputRG2) with R and V_R added. Thus, the new LDSC object have we have created includes

Correspondence between LDSC-Derived Standardized Genetic Covariance Matrix (S_Stand) and Genetic Correlations (R)

In this section, we examine the relationship between genetic correlations (R) estimated using the rgmodel() function and standardized genetic covariances (S_Stand) derived from the ldsc() function. We use two distinct examples to illustrate this correspondence, focusing on how sample overlap affects the comparison.

Example 1: Low-to-Moderate Sample Overlap

For this example, we analyze thirty traits (X1–X24) with low-to-moderate sample overlap. In the figure below, the solid blue line represents the regression of R on S_Stand with a slope \(b = 0.978\) (SE = 0.002), while the red dashed line represents the function \(y = x\) (i.e. a regression fixed to have intercept = 0 and slope = 1). We can observe close correspondence between R and S_Stand with a correlation coefficient \(r = 0.999\) and scatter closely centered around the dashed red y=x line. Note that we would typically expect the R and S_Stand matrices to be identical. However, when the S matrix is non-positive definite (e.g. when it includes heritabilities greater than 1.0 or genetic correlations outside of -1,1, it is smoothed to the nearest positive definite matrix prior to fitting the genetic correlation model.

Comparison of R and S_Stand for low-to-moderate sample overlap

Further analysis of standard errors and Z statistics associated with R and S_Stand reveals:

  • Standard Errors: The plot of standard errors indicates that S_Stand tends to exhibit larger standard errors in the presence of higher cross-trait LDSC intercepts (i.e. greater sample overlap and phenotypic correlation) compared to R. Note that the red dashed line represents y=x and the blue line is the least squares regression line.

Standard Errors for low-to-moderate sample overlap

  • Z Statistics: The Z statistics plot illustrates that the estimates in S_Stand are less significant (smaller Z statistics) for pairs of traits with higher cross-trait LDSC intercepts (i.e. greater sample overlap and phenotypic correlation). Note that the red dashed line represents y=x and the blue line is the least squares regression line.

Z Statistics for low-to-moderate sample overlap

Summary: The increased standard errors for S_Stand in the presence of greater sample overlap suggest that genetic correlations (R) estimated with the rgmodel() function are generally more precise.

Example 2: Moderate-to-High Sample Overlap

In the context of the second example, which exhibits moderate-to-high sample overlap, we again observe a close correspondence between R and S_Stand with \(r = 0.999\) and scatter closely centered around the dashed red y=x line.

Comparison of R and S_Stand for moderate-to-high sample overlap

However, the increased sample overlap across trait pairs results in noticeable discrepancies between the standard errors and Z statistics for R and S_Stand:

  • Standard Errors: The standard errors for S_Stand are considerably larger compared to those for R, highlighting the reduced precision of S_Stand estimates under high sample overlap. Note that the red dashed line represents y=x and the blue line is the least squares regression line.

Standard Errors for moderate-to-high sample overlap

  • Z Statistics: The Z statistics plot demonstrates that, in the context of higher cross-trait LDSC intercepts (i.e. greater sample overlap and phenotypic correlation), R estimates have greater statistical power compared to S_Stand. Note that the red dashed line represents y=x and the blue line is the least squares regression line.

Z Statistics for moderate-to-high sample overlap

Summary: As with the low-to-moderate sample overlap scenario, the presence of moderate-to-high sample overlap leads to larger standard errors for S_Stand compared to R for traits with higher cross-trait intercepts (more highly correlated estimation errors). Thus, genetic correlations estimated using the rgmodel() function generally offer greater precision and power.