The paLDSC function allows to identify the number of non-spurious dimensions in exploratory genomic factor analysis. Our method adapts a classic method known as Parallel Analysis (Horn, 1965) to the genomic space. paLDSC compares the eigenvalues generated from the eigen decomposition of the LDSC genetic correlation matrix to the eigenvalues of a Monte-Carlo simulated null correlation matrix with random noise drawn from the multivariate LDSC sampling distribution. The suggested number of factors to be extracted corresponds with an eigenvalue exceeding a pre-specified percentile from the corresponding distribution of eigenvalues generated under the null.

The paLDSC function requires two mandatory arguments: S and V. Additionally, several optional arguments can be used. Let’s go through each of them and understand their purpose:

Mandatory Arguments

Either the genetic correlation matrix (S_Stand) or the genetic covariance (S) from the LDSC output.

IMPORTANT: We strongly recommend using the LDSC genetic correlation matrix S_Stand as input. Analysis of the LDSC genetic correlation matrix S_Stand will maximize the proportion of genetic variance explained. Analysis of the LDSC genetic covariance matrix S maximizes the proportion of phenotypic variance explained since the genetic covariance is scaled with respect to standardized phenotypic variance. Results will diverge when there is more heterogeneity in SNP h2 across phenotypes.**

The corresponding standardized (V_Stand) or unstandardized multivariate sampling distribution matrix (V) from the LDSC output.

IMPORTANT: If the S argument corresponds with the LDSC genetic correlation matrix the mandatory argument V must be the multivariate sampling distribution matrix of the genetic correlation matrix, i.e. V_Stand. If the S argument corresponds with the LDSC genetic covariance matrix S the mandatory argument must be the LDSC multivariate sampling distribution matrix corresponding to the genetic covariance matrix, i.e. V.

Optional Arguments

The number of replications for the Monte-Carlo simulations of null correlation matrices. The default value is 500 replications.

The percentile for the simulated eigenvalues. This determines the threshold for identifying the number of non-spurious dimensions. The default value is 0.95.

A logical value indicating whether diagonalized PA should be conducted. If TRUE, the function will assume uncorrelated sampling errors for the simulated null correlation matrices. If FALSE, the whole multivariate sampling distribution LDSC V matrix will be used to simulate the null. The default value is FALSE.

A logical value indicating whether the eigenvalues should also be computed from a common factor solution from an exploratory factor analysis using the method described for the fa.parrallel function from the psych package. If TRUE, the function will perform factor analysis and compute eigenvalues. The default value is FALSE.

The factor method for the exploratory factor analysis if fa = TRUE. It determines the method used to extract factors. The default value is “minimum residual” (see documentation for fa.parrallel function from the psych package).

The number of factors to be extracted in the exploratory factor analysis if fa = TRUE. The default value is 1.

A logical value indicating whether the scree plots derived from the PA function should be saved into a PDF file. The default value is FALSE.

If TRUE, only the eigenvalues are computed and returned, otherwise both eigenvalues and eigenvectors are returned (FALSE by default).

Example

Now let’s run the paLDSC function with default settings and 10 replications (see optional arguments above), using the LDSC output provided here:

#1. Load multivariate LDSC output
load("LDSC_output_paLDSC_example.RData")
#2. Genetic correlation matrix from LDSC output
S_Stand <- LDSCoutputfull$S_Stand
#3. Associated standardized multivariate sampling distribution matrix from LDSC output
V_Stand <- LDSCoutputfull$V_Stand
#4. Run the`paLDSC` function
paLDSC(S=S_Stand,V=V_Stand,r=10)
## Running parallel analysis. Replication number:  1 
## Running parallel analysis. Replication number:  2 
## Running parallel analysis. Replication number:  3 
## Running parallel analysis. Replication number:  4 
## Running parallel analysis. Replication number:  5 
## Running parallel analysis. Replication number:  6 
## Running parallel analysis. Replication number:  7 
## Running parallel analysis. Replication number:  8 
## Running parallel analysis. Replication number:  9 
## Running parallel analysis. Replication number:  10

## ------------------------------------------------------------------------ 
## Parallel Analysis suggests extracting 1 components 
## ------------------------------------------------------------------------

The paLDSC output shows that the first component from the LDSC-derived genetic correlation matrix has a larger eigenvalue than 95% of the eigenvalues observed for the corresponding component from the null correlation matrix.

This result suggests that there is one dominant factor or dimension in the data. Extracting this component can help capture the underlying structure or pattern in the genetic data. This can be particularly useful for further analyses, such as factor analysis, visualization, or downstream statistical modeling.

Keep in mind that the decision to extract 1 component is based on the specified percentile threshold (default: 95%) and the comparison of eigenvalues between the LDSC-derived genetic correlation matrix and the null correlation matrix generated through Monte Carlo simulations. It is important to interpret the results in the context of your specific analysis and consider other factors such as the nature of the data and the goals of your research.