April 05, 2017

GWAS Study Design

Type 2 Diabetes

  • Type 2 diabetes is characterized by insulin resistance in peripheral tissues and dysregulated insulin secretion by pancreatic \(\beta\)-cells.
  • Pathogenesis is heterogeneous.

Literature review

Exploring our GWAS data

  • Our data set contains, 6424 SNPs, for 248 individuals, with 159 controls and 89 cases of T2D.

Quality assurance in GWAS

  • We want to remove any artifacts which may increase Type I or Type II errors. Examples of these artifacts include (Anderson 2010):
    • Differences in population structure
    • Differences in DNA quality or handling procedures
    • Missing genotype rate
  • Quality assurance (QA) relates to the post-production review of quality (Laurie 2010). Two categories of QA:
    • Subject-level
    • SNP-level

Quality assurance summary

  • Using a range of QA measures at the subject and SNP level we end up dropping 8 subjects and 493 SNPs (table below can include double counting).
Table 1: Number of potential exclusions
Exclusion Number Type
Missing Call Rates 1 Subject
Population structure 4 Subject
Heterozygosity 0 Subject
Gender check 2 Subject
Relatedness 0 Subject
Case-control PC 0 Subject
Case-control missing call rate 29 SNP
Missing call rate over samples 205 SNP
Duplicate sample discordance 4 Subject
Hardy-Weinberg Equilibrium 11 SNP
Minor allele frequency 132 SNP
Linkage disequilibrium 138 SNP

Statistical approach: logistic regression

  • A logistic regression models the log-odds of observing a given phenotype for person \(i\) as a function of a baseline probability (the intercept), an additive measure of their genotype \(g_i\), and a vector of potential confounders \(x_i\).

\[ \begin{align} \log\Bigg(\frac{P(\text{Phenotype}_i)}{1-P(\text{Phenotype}_i)} \Bigg) &= \beta_0 + \beta_1 g_i + \gamma^Tx_i \end{align} \]

  • Under the null hypothesis of no genotypic effect, \(H_0: \hspace{2mm} \beta_1=0\).
  • We control for two potential confounders: gender and population structure.

Statistical approach: logistic regression

  • Our baseline \(g_i\) assumes an additive effect
  • \(\beta_1\) represents the genotype relative risk (GRR)

  • For example SNP rs2843403 has CC (112), TC (108), and TT (28), so:

\[ \begin{align} g_i^{\text{additive}} &= \begin{cases} 0 & \text{if } \text{CC} \\ 1 & \text{if } \text{TC} \\ 2 & \text{if } \text{TT} \end{cases} \hspace{3mm} \text{ and } \hspace{3mm} g_i^{\text{dominant}} = \begin{cases} 0 & \text{if } \text{CC} \\ 1 & \text{if } \text{TC} \\ 1 & \text{if } \text{TT} \end{cases} \hspace{3mm} \end{align} \]

Adjusting for multiple hypothesis testing

  • Family-wise error rate measures the number of false positives.
  • Adjustments: "conservative" Bonferroni or Monte Carlo simulations
  • Permutation testing which shuffles the phenotype labels:
    • Use we 5000 runs
    • Generates empirical distribution of p-values

Manhattan plot for T2D

Odds ratios for additive and dominant model

  • The figures below shows the estimated effect size for the two SNPs of interest. While the p-value measures statistical significance, it does not measure biological significance.

SNPs of interest

SNPs of interest

  • rs10008252 \(\to\) SNORD65 (upstream gene variant)
    • Found on chromosome 4
    • No association to a phenotype
    • SNORD: Small nucleolar RNA whose primary function is to guide chemical modifications of other RNAs
  • rs1045971 \(\to\) C16orf46 (aka FLJ32702 gene)
    • Found on chromosome 16
    • Codes for an uncharacterized protein
    • Microarray data shows expression most prevalent on testis tissue ADHD study
    • FLJ32702 interacts with FAT atypical cadherin 3 protein

Limitations

  • Only one cohort
  • T2D is associated with common variants \(\to\) less power
  • Small sample size \(\to\) less power
  • Accounting for interactions within the genome and the environment:
    • GWAS does not consider the genetic interaction between loci (epistasis)
    • Nor the interaction(s) between loci and the environment

Next steps

  • Increase sample size: \(>1000\)
  • Incorporate NGS techniques to discover less common and rare variation
  • Develop informatics methods that examine epistasis and gene vs. environment interactions
  • Meta-analysis with multiple cohorts
  • Following meta-analysis, translate findings to the clinical setting using risk assessment tools for candidate genes \(\to\) gene-specific therapies

References

Anderson, C. et al. 2010. “Data Quality Control in Genetic Case-Control Association Studies.” Nature Protocols 5 (9): 1564–73. doi:10.1038/nprot.2010.116.

Laurie, C. et al. 2010. “Quality Control and Quality Assurance in Genotypic Data for Genome-Wide Association Studies.” Genetic Epidemiology 34 (26): 591–602. doi:doi:10.1002/gepi.20516.