Garrett W. Ennis and Elliot M. Tucker-Drob

Jan 28 2025

Overview of the indexS(), subSV(), summaryGLS() and summaryGLSbands() functions

Here we provide documentation for the indexs(), subSV(), summaryGLS(), and summaryGLSbands() functions created to support the GTACCC analyses reported in Ennis et al. (“Genomic Taxometric Analysis of Continuous and Case-Control”). These functions are used to index the elements in a correlation (R) or covariance (S or S_Stand) matrix, subset the matrix and its associated sampling covariance matrix (V) to the indexed elements, and perform generalized least squared (GLS) regression using a vector of observations (e.g. the elements from the subset S matrix) and its associated matrix of sampling covariances (e.g. the elements from the associated subset V matrix).

SummaryGLS() applies generalized least squares (GLS) regression, where the dependent variable is a vector of observations with a known matrix of sampling covariances. By employing a sampling covariance matrix with potentially unequal diagonal elements and nonzero off-diagonal elements, GLS relaxes the traditional assumption (e.g. of ordinary least squares) that the observations are independent and identically distributed. When the off-diagonal elements of the sampling covariance matrix are zero, GLS is equivalent to weighted least squares regression. When the off-diagonal elements are zero and diagonal elements are constant, GLS is equivalent to ordinary least squares regression.

The indexS() function is used to index the elements in a correlation matrix (R) or covariance (S or S_Stand) matrix, and the SubSV() function can use these index values to subset that matrix and its associated sampling covariance matrix (V, V_Stand, or V_R) to the elements of interest. These elements can then be input into the summaryGLS() function for modelling covariances or correlations as dependent variables. When the goal is to model genetic correlations or genetic covariances that have been estimated using LD Score Regression, these functions can be applied directly to an LDSC object (output from the ldsc() function within Genomic SEM) containing the relevant matrices.

SummaryGLSbands() allows the user to create a scatterplot with a linear or quadratic trend line estimated using GLS with appropriate error bands for a single predictor variable, and includes an option to include a set of control variables. We provide further details for each function below, accompanied by examples illustrating some of the analyses conducted in Ennis et al.

indexS()

IndexS() is a function that returns a matrix of index values for each free parameter from a covariance or correlation matrix. The function requires either an LDSC object (LDSC_OBJECT = ), or a matrix (MATRIX = ; for modeling genetic covariances, this would be the S matrix from the ldsc() function). The matrix of index values that is returned by this function can then subset to the cells of interest, using standard R syntax, such as [ , ] and diag( ).

The following arguments are required:

LDSC_OBJECT or MATRIX: This function requires either an LDSC object set with (LDSC_OBJECT = ) from which the S will be extracted, or an S matrix set with (MATRIX = ). Note that only the size of the square matrix is necessary for indexS(). Thus, if desired, the user can input an empty square matrix of the desired size for (MATRIX = ).

The following arguments are optional:

R: By default, this function will return index values from an LDSC object from either covariance matrix (S or S_Stand), in which the diagonal elements are free parameters. However, if the matrix being analyzed is a correlation matrix (R), in which the diagonal elements are fixed parameters (i.e. 1’s), and therefore do not have associated estimation errors, the correct index values can be obtained by specifying (R = T).

Examples for indexS()

The user may obtain the index values for the correlations between a set of traits and one external trait by specifying either the row or column of the external trait of interest, as the matrix of index values is symmetrical around the diagonal. Below is an example pulling the index values of 12 correlations between the first 12 traits and the 13th trait:

indexS(MATRIX = LDSCoutput$S)[1:12,13]
##  [1]  13  29  44  58  71  83  94 104 113 121 128 134

Here is an example of acquiring the same index values from the for the R matrix

indexS(MATRIX = LDSCoutput$S, R = T)[1:12,13]
##  [1]  12  27  41  54  66  77  87  96 104 111 117 122

If the user would like to model variances (e.g. heritabilities) across a subset of traits in S, the diagonal of the matrix would be specified, with the subset of traits as follows:

diag(indexS(MATRIX = LDSCoutput$S)[1:12,1:12])
##  [1]   1  18  34  49  63  76  88  99 109 118 126 133

The user may want to obtain values for the elements of a matrix corresponding to cross-group within-trait associations. This may be achieved by selecting the index values for the diagonal elements of the cross-group submatrix. In this hypothetical scenario, the set of traits in group one are represented by the first five variables, and the same traits in group two are represented in the same order by the next five variables.

diag(indexS(MATRIX = LDSCoutput$S)[1:5,6:10])
## [1]  6 23 39 54 68

subSV()

subSV() is a function which subsets an S and V matrices to the index values specified by indexS(), and takes the below arguments:

LDSC _OBJECT or SMATRIX and VMATRIX : This functions requires either an LDSC object from which the S and V matrix will be extracted, or the S and V matrix directly provided by the user. By default, the unstandardized S and V matrices will be extracted from the LDSC object if one is provided. If available in the LDSC object, standardized matrices may be extracted by setting (TYPE = “S_Stand”) and genetic correlations (which can be estimated and added to the LDSC object using rgmodel()) may be extracted by setting (TYPE = “R”).

INDEXVALS: This argument is the output of indexS(), and is a vector of index values which map the parameters of interest, from the S matrix, onto their cross-parameter estimation errors as found in the V matrix, which is more structurally complex. Note that if the user would like to subset an R matrix of genetic correlations, the index values should be obtained from indexS() with (TYPE = “R”) selected, and either an LDSC object which has been submitted to the rgmodel() function, or, alternatively, the R and V_R matrices can be directly submitted to the SMATRIX = and VMATRIX = arguments respectively.

TYPE: This argument defines the type of correlations extracted from the LDSC object if it is submitted to the function instead of separate S and V matrices. By default, subSV() is set to “S”, which provides unstandardized ldsc estimates. Two other options exist: “S_Stand” which will select the standardized correlations from the LDSC object if stand = TRUE was specified when estimating ldsc(); or “R” which will select the genetic correlations from the LDSC object if it has been run through the rgmodel() function.

Example of using subSV()

Here, we illustrate the use of subSV() using the LDSC object from Ennis et al., which includes some genetic correlations estimated or modeled in the paper. This initial exammple subsets the provided LDSC object to conduct the primary GLS regression - using the relations between the set of 12 neuroticism cut points and MDD as the dependent variable. Within the provided LDSC object, the neuroticism cutpoints are the first 12 traits, and MDD is the 13th trait.

Using indexS(), we select the index values of the V matrix corresponding to the elements of interest in the S matrix. These index values may be obtained with indexS(MATRIX = LDSCoutput$S)[1:12,13], and when put into subSV() will return the subset S and V matrices as seen below.

subSV(LDSC_OBJECT = LDSCoutput,
      INDEXVALS = indexS(MATRIX = LDSCoutput$S, R = F)[1:12,13], TYPE = "S_Stand")

To extract heritabilities we can specify the diagonal of the block of traits that we wish to model, and set TYPE = “S” as shown below.

subSV(LDSC_OBJECT = LDSCoutput,
      INDEXVALS = diag(indexS(MATRIX = LDSCoutput$S)[1:12,1:12]), TYPE = "S")

summaryGLS()

This function implements generalized least squares (GLS) regression, which is used when an estimate of the sampling covariance matrix of the error structure is available. By default, an intercept is included. This may be toggled off with (INTERCEPT = F).

The summaryGLS() function requires: a vector of observations for the dependent variable (e.g. a subset S matrix) and its associated sampling covariance matrix (V), either obtained from a subsetting an LDSC object, or from an external source. This may be set either by submitting an object with a V and an S matrix with the (OBJECT = ) argument, or by submitting a separate vector of observations for the dependent variable along with its associated sampling covariance matrix through the (Y = ), and (V_Y = ) arguments respectively; and a vector or matrix of predictor values (PREDICTORS = ), with number of rows equal to the number of observations for the dependent variable and the number of columns equal to the number of predictor variables. If the user wants to add vectors of higher order polynomial or interaction terms into GLS regression as predictors, these predictors must be created manually and input as separate variables in the matrix of predictors.

The output of summaryGLS() is a matrix of betas, pvalues, standard errors and z statistics for each beta estimated, with the number after each beta corresponding the the column number for the predictors.

For more details on GLS regression, please see:

Aitken, A. C. (1935). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48. https://doi.org/10.1017/S0370164600014346

Savalei, V. (2014). Understanding Robust Corrections in Structural Equation Modeling. Structural Equation Modeling: A Multidisciplinary Journal, 21(1), 149–160. https://doi.org/10.1080/10705511.2013.824793

https://en.wikipedia.org/wiki/Generalized_least_squares

Example of summaryGLS()

Here we show the output of summaryGLS() which carries out the primary GLS analysis from Ennis et al., to estimate how the genetic correlation between neuroticism and MDD varies across the severity spectrum of negative emotionality. We supply summaryGLS() with the subset S_Stand and V_Stand matrices as obtained with the subSV() function.

#These are the cutpoint prevalences for each neuroticism cutpoint in the primary
#GLS analysis testing trends in the genetic correlations between neuroticism
#and MDD across the severity spectrum of negative emotionality

PREVS <- c(0.85556773,0.75867410,0.64767354,
           0.53828939,0.43558773,0.34194621,
           0.25818629,0.19038580,0.13159727, 
           0.08571119,0.04947786,0.02166147)

#Here we are mapping the cutpoint prevalences to the normal distribution
PREDICTORS <- -1*qnorm(PREVS)

summaryGLS(
          #here subSV() is used, with indexS() inside of it
          OBJECT = subSV(LDSC_OBJECT = LDSCoutput,
                        INDEXVALS = indexS(MATRIX = LDSCoutput$S_Stand, R = F)[1:12,13], TYPE = "S_Stand"), 
                        PREDICTORS = PREDICTORS, INTERCEPT = T)
##         betas        pvals         SE        Z
## b0 0.52648636 1.055411e-18 0.05963095 8.829079
## b1 0.09049874 1.584398e-03 0.02865005 3.158764

summaryGLSbands()

summaryGLSbands() is a version of summaryGLS() that produces scatter plots for the GLS regression that include error bands. A scatter plot is generated with the outcome values represented on the Y axis, and the provided vector of predictor values on the X axis; the labels of these axes may be defined with (XLAB = ) and (YLAB = ), which are by default empty strings. Additionally, the X- and Y-axes may be modified through the (XCOORDS = ) and (YCOORDS = ) arguments, which are by default set to XCOORDS = c(-2,2), and YCOORDS = c(0,1).

Additionally, a set of control variables may be added with the argument (CONTROLVARS = ).

Error bands for this plot are provided by default and may be toggled off by setting (BANDS = F). Additionally, error bands are automatically set to +/- 1 SE - these may be changed to a different multiple of SEs (e.g. 1.96 for 95% confidence interval bands) with with the argument (BAND_SIZE = ).

As with summaryGLS(), this function requires for input a vector of predictors, set with the (PREDICTORS = ) argument; as well as a vector of dependent variables and their associated sampling covariance matrix, which may be submitted together as one object, as with a subset LDSC object using the subSV() function, with the (OBJECT = ) argument, or seperately with the (Y = ), and (V_Y = ) arguments which are the dependent variables and their sampling covariance matrices respectively.

Additionally, the horizontal resolution with which the error bands are estimated may be adjusted with the argument (INTERVALS = ); with the number of intervals required to be equal to or greater than the number of variables along the span of which you are estimating.

PREDICTORS <- -1*qnorm(c(0.85556773,0.75867410,0.64767354,
           0.53828939,0.43558773,0.34194621,
           0.25818629,0.19038580,0.13159727, 
           0.08571119,0.04947786,0.02166147))

summaryGLSbands(OBJECT = subSV(LDSC_OBJECT = LDSCoutput,
                INDEXVALS = indexS(MATRIX = LDSCoutput$S, R = F)[1:12,13], TYPE = "S_Stand"),
                PREDICTORS = PREDICTORS,
                INTERVALS = 40
                ,
                XLAB = "Neuroticism Severity (Z-Distribution)",
                YLAB = "Genetic Correlation w/ MDD",
                QUAD = T,
                XCOORDS = c(-1,2),
                YCOORDS = c(0,1))
##         betas        pvals         SE        Z
## b0 0.50936285 4.495604e-17 0.06064427 8.399192
## b1 0.08047254 6.145083e-03 0.02937021 2.739938
## b2 0.03375832 1.208844e-01 0.02176451 1.551072

Here, we add in 2 control variables (randomly generated for illustration). The betas, SEs etc… for the control variables are listed after those for the intercept, linear and (if specified, as is done here) quadratic coefficients.

PREDICTORS <- -1*qnorm(c(0.85556773,0.75867410,0.64767354,
           0.53828939,0.43558773,0.34194621,
           0.25818629,0.19038580,0.13159727,
           0.08571119,0.04947786,0.02166147))
 
CONTROLS = cbind((runif(12,0,1)),runif(12,0,1))
 
summaryGLSbands(OBJECT = subSV(LDSC_OBJECT = LDSCoutput,
                INDEXVALS = indexS(MATRIX = LDSCoutput$S, R = F)[1:12,13], TYPE = "S_Stand"),
                PREDICTORS = PREDICTORS,
                CONTROLVARS = CONTROLS,
                INTERVALS = 12,
                XLAB = "Neuroticism Severity (Z-Distribution)",
                YLAB = "Genetic Correlation w/ MDD",
                QUAD = T,
                XCOORDS = c(-1,2),
                YCOORDS = c(0,1))
##         betas        pvals         SE         Z
## b0 0.48792971 2.034484e-15 0.06145810 7.9392250
## b1 0.07757701 9.233269e-03 0.02979956 2.6032942
## b2 0.04549162 4.231863e-02 0.02240555 2.0303730
## b3 0.01249622 5.661991e-01 0.02178338 0.5736582
## b4 0.05270801 3.429267e-02 0.02490205 2.1166135