1 Analysis Summary

The analysis integrates Genomic Structural Equation Modeling (GenomicSEM) with multivariate summary statistics of three chemokines—CXCL3, CCL4, and CCL27—identified as moderately genetically correlated through linkage disequilibrium score regression (LDSC). By leveraging multivariate methods, the study highlights the advantages of capturing shared genetic architecture across traits while reducing noise compared to traditional univariate approaches.

Key findings include the identification of two genomic loci, on chromosomes 2 and 7, containing variants (e.g., rs1260326 near GCKR and rs7779873 near CD36) with significant associations and potential implications in metabolic pathways. Functional annotation using tools such as CADD/PHRED and FUMA underscores the biological relevance of these loci.

The value of GenomicSEM lies in its ability to provide a robust framework for testing the multivariate genetic architecture of traits, enhancing statistical power and uncovering pleiotropic effects that may remain undetected in pairwise or univariate analyses. By integrating functional annotations and mapping tools, this approach strengthens the interpretation of genetic findings and their translational potential in complex traits, such as inflammatory and metabolic disorders.

1.1 Bioinformatics Processing: Fetching rsID and Pvals

1.2 Munge by Chromsome

1.3 Move to munged_40

1.4 Merge the Munged Files for Each Protein

nohup Rscript merged_munged_40.R > merged_munged_40.log 2>&1 &

1.5 Add P Back for Sumstats

nohup Rscript add_p_sumstats.R > add_p_sumstats.log 2>&1 &

1.6 Sumstats of Munged Data

nohup Rscript sumstats_on_sumstats.R > sumstats_on_sumstats.log 2>&1 &

1.7 MV LDSC on 40 Chemokines

nohup Rscript ldsc40.R > ldsc40.log 2>&1 &

1.8 Selected Three Moderately Genetically Correlated Chemokines to Format with sumstats: CXCL3 CCL4 CCL27

Pair Chemokine 1 Chemokine 2 Genetic Correlation Standard Error
CCL4_CXCL3 CCL4 CXCL3 0.3367 0.1022
CCL27_CXCL3 CCL27 CXCL3 0.4379 0.1042
CCL27_CCL4 CCL27 CCL4 0.2758 0.1001

1.9 LDSC For Three Moderately Genetically Correlated Chemokines

1.10 Parallel Analysis Based on Multivariate LDSC

1.10.1 Code for Parallel Analysis Based on Multivariate LDSC

1.10.2 Parallel Analysis Based on Multivariate LDSC: Decision Figure

# Path to the saved plot
plot_path <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_chemokine_list/munged_40/merged_munged_protein_results/pvals_fix/ldsc_results_rgs/paLDSC_plot.png"

# Include the plot
include_graphics(plot_path)

1.11 Exploratory Factor Analysis

1.12 One-Factor Multivariate GWAS (userGWAS)

1.12.1 Background Definitions

Structural Equation Modeling (SEM) is a statistical technique used to model and analyze complex relationships between observed (measured) and unobserved (latent) variables. It combines elements of multiple regression, factor analysis, and path analysis into a single, flexible framework.

Likewise, Genomic Structural Equation Modeling (GenomicSEM) is a statistical framework designed for SEM with GWAS summary statistics. It adapts SEM principles to work with genetic covariance matrices, enabling the study of complex relationships among traits, shared genetic architectures, and causal pathways. It uses two matrices:

  1. S Matrix (Genetic Covariance Matrix):
    • The S matrix represents the genetic covariance among traits. It is derived from LDSC and provides unstandardized estimates of genetic covariances based on GWAS summary statistics.
    • For example, since we are analyzing three traits (e.g., CXCL3, CCL4, and CCL27), the S matrix would include pairwise genetic covariances between these traits.
  2. V Matrix (Sampling Covariance Matrix):
    • The V matrix is the corresponding sampling covariance matrix for the genetic covariance estimates in the S matrix. It reflects the uncertainty (variance) and interdependence (covariance) of the estimates in the S matrix.
    • The V matrix is crucial for properly accounting for standard errors and dependencies in the modeling process.

  • Modeling:

    • The S matrix is used to model the relationships between traits.
    • The V matrix ensures accurate statistical inference by incorporating uncertainties in the genetic covariance estimates.

  1. General SEM Framework:

    • lavaan is a tool for specifying and estimating structural equation models. In GenomicSEM, the structural equation modeling principles are implemented in lavaan syntax. That is, the model is specified in lavaan.
  2. Parameter Estimation:

    • lavaan relies on covariance matrices for parameter estimation in SEM.

nohup Rscript usergwas_3chemokines.R > usergwas_3chemokines.log 2>&1 &

1.13 Value

nohup Rscript /Users/charleenadams/ukbppp/comp_manhattan.R > comp_manhattan.log 2>&1 &

## CXCL3 Manhattan Plot

## CCL4 Manhattan Plot

## CCL27 Manhattan Plot

## Latent Factor Manhattan Plot (no filtering)

## Stringent QC: Labeled Latent Factor Manhattan Plot Manhattan Plot

1.13.1 Comparison with Original GWAS’s P-values

1.13.1.1 Code for P-values Comparison

1.13.2 P-value Figures

1.13.2.1 Code for P-value Figures

1.14 Nearest Genes from Ensembl: CD36 and GCKR

1.14.1 Code for Nearest Genes from Ensembl

1.14.2 CD36: Cluster of Differentiation 36

  • Location: Chromosome 7q21.11
  • Function: Encodes a transmembrane glycoprotein involved in:
    • Fatty Acid Uptake: Critical for lipid metabolism.
    • Lipid Metabolism: Regulates lipid storage and utilization.
    • Immune Response: Contributes to inflammation and pathogen recognition.
    • Angiogenesis: Supports blood vessel formation.
  • Clinical Relevance:
    • Implicated in insulin resistance, atherosclerosis, and cancer progression.
    • A key target for therapies addressing metabolic syndromes and cardiovascular diseases.

1.14.3 GCKR: Glucokinase Regulatory Protein

  • Location: Chromosome 2p23.3
  • Function: Encodes a protein that modulates glucokinase activity, playing a pivotal role in:
    • Glucose Metabolism: Regulates glucose levels in the liver and pancreas.
  • Clinical Relevance:
    • Variants influence fasting glucose levels and triglyceride concentrations.
    • Associated with the risk of type 2 diabetes and non-alcoholic fatty liver disease (NAFLD).

1.14.4 Integrated Role in Metabolic Disorders

Together, Cluster of Differentiation 36 (CD36) and Glucokinase Regulatory Protein (GCKR) are critical to understanding the genetic architecture of metabolic disorders. These genes provide valuable insights for developing therapeutic strategies targeting:
- Diabetes
- Dyslipidemia
- Cardiovascular Diseases
- NAFLD

1.14.5 Code for Fetching Nearest Genes Bioinformatically

1.15 CADD (Combined Annotation Dependent Depletion)

SNP CHR BP MAF A1 A2 est SE Pval_Estimate nearest_gene Consequence ConsDetail GeneName PHRED PhastCons PhyloP
rs1728918 2 27635463 0.272366 A G 0.04235485 0.006950637 1.103550e-09 PPM1G UPSTREAM upstream PPM1G 2.458 0.060 0.283
rs1260326 2 27730940 0.410537 T C 0.04512821 0.006290023 7.253483e-13 GCKR NON_SYNONYMOUS splice,missense GCKR 13.220 0.995 0.943
rs780094 2 27741237 0.410537 T C 0.04275407 0.006290005 1.067132e-11 GCKR INTRONIC intron GCKR 1.852 0.000 -0.872
rs780093 2 27742603 0.411531 T C 0.04217555 0.006287704 1.978192e-11 GCKR INTRONIC intron GCKR 1.541 0.000 -0.109
rs7779873 7 80211423 0.453280 G A -0.03564340 0.006215646 9.782121e-09 CD36 INTRONIC intron CD36 5.362 0.002 0.173
rs6961069 7 80218961 0.419483 C T -0.03646031 0.006270293 6.071746e-09 CD36 INTRONIC intron CD36 5.150 0.000 0.382
rs13236689 7 80236014 0.424453 T G -0.03506074 0.006260322 2.137726e-08 CD36 INTRONIC intron CD36 8.262 0.005 -0.008

1.15.1 Interpretations

  • CADD-PHRED Score:
    • PHRED scores indicate the relative deleteriousness of variants.
    • Scores above 10: Variant is in the top 10% of deleterious variants.
    • Scores above 20: Variant is in the top 1% of deleterious variants.
  • PhastCons:
    • Represents evolutionary conservation scores, with higher values indicating higher conservation across species.
    • \(0\) to \(1\)
    • 0: No conservation.
    • 1: Maximum conservation, indicating a highly conserved region across species.
  • PhyloP:
    • Measures evolutionary conservation or acceleration at specific sites.
    • Positive scores indicate conservation, while negative scores indicate acceleration.
    • Negative to Positive (\(-\infty\) to \(+\infty\)).
    • Most scores typically fall within \(-3\) to \(+3\).
  • Conclusion: The analyzed variants exhibit varying levels of functional significance based on CADD PHRED scores and conservation metrics. Variants such as rs1260326 (PHRED: 13.22) are of higher interest due to their non-synonymous coding impact and proximity to the gene GCKR. Variants within CD36 show low conservation but may play roles in regulatory mechanisms.

1.15.2 Protein Phosphatase, Mg²⁺/Mn²⁺ Dependent 1G (PPM1G)

PPM1G encodes a serine/threonine phosphatase that is part of the protein phosphatase 2C (PP2C) family. It plays a critical role in:
- Dephosphorylation: Removes phosphate groups from serine and threonine residues on proteins.
- Cell Cycle Regulation: Supports proper cell cycle progression.
- Pre-mRNA Splicing: Ensures accurate and efficient gene expression by regulating pre-mRNA splicing.
- Cell Stress Response: Dephosphorylates proteins involved in stress adaptation.
- Cancer Biology: Altered expression has been linked to tumorigenesis.
- Neurological Function: Plays a role in neuronal signaling and brain activity.

1.15.3 Code CADD/PHRED

1.16 Functional Mapping and Annotation (FUMA) of Genome-Wide Association Studies

1.16.1 GTEx Heatmap

## GTEx Heatmap

1.16.2 “Indepdendent” SNPs (r2 < 0.6) at the Chr 2 and Chr 7 Loci

FUMA’s default for independence is Plink’s clumping at r2 < 0.6. This is user-defined. I could have specified a stricter independence threshold, but these are “indepedent at r2 < 0.6.

Genomic Locus uniqID rsID chr pos
1 2:27635463:A:G rs1728918 2 27635463
1 2:27730940:C:T rs1260326 2 27730940
2 7:80211423:A:G rs7779873 7 80211423
2 7:80218961:C:T rs6961069 7 80218961