The paLDSC function allows to identify the number of non-spurious dimensions in exploratory genomic factor analysis. Our method adapts a classic method known as Parallel Analysis (Horn, 1965) to the genomic space. paLDSC compares the eigenvalues generated from the eigen decomposition of the LDSC genetic correlation matrix to the eigenvalues of a Monte-Carlo simulated null correlation matrix with random noise drawn from the multivariate LDSC sampling distribution. The suggested number of factors to be extracted corresponds with an eigenvalue exceeding a pre-specified percentile from the corresponding distribution of eigenvalues generated under the null.
The paLDSC function requires two mandatory
arguments: S and
V. Additionally, several optional
arguments can be used. Let’s go through each of them and understand
their purpose:
SEither the genetic correlation matrix
(S_Stand) or the genetic covariance
(S) from the LDSC output.
IMPORTANT: We strongly recommend using the LDSC
genetic correlation matrix S_Stand as input. Analysis of
the LDSC genetic correlation matrix S_Stand will maximize
the proportion of genetic variance explained. Analysis of the LDSC
genetic covariance matrix S maximizes the proportion of
phenotypic variance explained since the genetic covariance is scaled
with respect to standardized phenotypic variance. Results will diverge
when there is more heterogeneity in SNP h2 across
phenotypes.**
VThe corresponding standardized
(V_Stand) or unstandardized multivariate
sampling distribution matrix (V) from the
LDSC output.
IMPORTANT: If the S argument
corresponds with the LDSC genetic correlation matrix the mandatory
argument V must be the multivariate sampling distribution
matrix of the genetic correlation matrix,
i.e. V_Stand. If the S
argument corresponds with the LDSC genetic covariance matrix
S the mandatory argument must be the LDSC
multivariate sampling distribution matrix corresponding to the genetic
covariance matrix, i.e. V.
rThe number of replications for the Monte-Carlo simulations of null
correlation matrices. The default value is 500
replications.
pThe percentile for the simulated eigenvalues. This determines the
threshold for identifying the number of non-spurious dimensions. The
default value is 0.95.
diagA logical value indicating whether diagonalized PA should be
conducted. If TRUE, the function will assume uncorrelated
sampling errors for the simulated null correlation matrices. If
FALSE, the whole multivariate sampling distribution LDSC V
matrix will be used to simulate the null. The default value is
FALSE.
faA logical value indicating whether the eigenvalues should also be
computed from a common factor solution from an exploratory factor
analysis using the method described for the fa.parrallel
function from the psych package. If TRUE, the function will
perform factor analysis and compute eigenvalues. The default value is
FALSE.
fmThe factor method for the exploratory factor analysis if
fa = TRUE. It determines the method used to extract
factors. The default value is “minimum residual” (see documentation for
fa.parrallel function from the psych package).
nfactorsThe number of factors to be extracted in the exploratory factor
analysis if fa = TRUE. The default value is 1.
save.pdfA logical value indicating whether the scree plots derived from the
PA function should be saved into a PDF file. The default value is
FALSE.
only.valuesIf TRUE, only the eigenvalues are computed and returned,
otherwise both eigenvalues and eigenvectors are returned
(FALSE by default).
Now let’s run the paLDSC function with
default settings and 10 replications (see optional arguments above),
using the LDSC output provided here:
#1. Load multivariate LDSC output
load("LDSC_output_paLDSC_example.RData")
#2. Genetic correlation matrix from LDSC output
S_Stand <- LDSCoutputfull$S_Stand
#3. Associated standardized multivariate sampling distribution matrix from LDSC output
V_Stand <- LDSCoutputfull$V_Stand
#4. Run the`paLDSC` function
paLDSC(S=S_Stand,V=V_Stand,r=10)
## Running parallel analysis. Replication number: 1
## Running parallel analysis. Replication number: 2
## Running parallel analysis. Replication number: 3
## Running parallel analysis. Replication number: 4
## Running parallel analysis. Replication number: 5
## Running parallel analysis. Replication number: 6
## Running parallel analysis. Replication number: 7
## Running parallel analysis. Replication number: 8
## Running parallel analysis. Replication number: 9
## Running parallel analysis. Replication number: 10
## ------------------------------------------------------------------------
## Parallel Analysis suggests extracting 1 components
## ------------------------------------------------------------------------
The paLDSC output shows that the first component from the LDSC-derived genetic correlation matrix has a larger eigenvalue than 95% of the eigenvalues observed for the corresponding component from the null correlation matrix.
This result suggests that there is one dominant factor or dimension in the data. Extracting this component can help capture the underlying structure or pattern in the genetic data. This can be particularly useful for further analyses, such as factor analysis, visualization, or downstream statistical modeling.
Keep in mind that the decision to extract 1 component is based on the specified percentile threshold (default: 95%) and the comparison of eigenvalues between the LDSC-derived genetic correlation matrix and the null correlation matrix generated through Monte Carlo simulations. It is important to interpret the results in the context of your specific analysis and consider other factors such as the nature of the data and the goals of your research.