October 26, 2018

Outline

  • Test statistics
  • Allele coding
  • Power & resolution
  • Linkage mapping
  • LD mapping
  • Structure
  • Imputation
  • GLM
  • MLM
  • WGR
  • Rare-variants
  • Validation studies

Test statistics

  • Testing associations are as simple as t-test and ANOVA

Test statistics

  • A more generalized framework: Likelihood test

\[LRT = L_0 / L_1 = -2(logL_1 - logL_0)\]

For the model:

\[y=Xb+Zu+e\\ y\sim N(Xb,V)\]

REML function is given by:

\[L(\sigma^2_u,\sigma^2_e)=-0.5( ln|V|+ln|X'V^{-1}X|+y'Py)\]

Where \(V=ZKZ'\sigma^2_u+I\sigma^2_e\) and \(y'Py=y'e\)

Allele coding

Types of allele coding

  1. Add. (1 df): {-1,0,1} or {0,1,2} - Very popular (Lines, GCA)
  2. Dom. (1 df): {0,1,0} - Popular (Trees, clonals and Hybrids)
  3. Jointly A+D (2 df): Popular on QTL mapping in F2s
  4. Complete dominance (1 df): {0,0,1} or {0,1,1} - Very unusual
  5. Interactions (X df): Marker x Factor (epistasis and GxE)

Power and resolution

Power

  • Key: Number of individuals & allele frequency
  • More DF = lower power
  • Multiple testing: Bonferroni and FDR
  • Tradeoff: Power vs false positives

Resolution

  • Genotyping density
  • LD blocks
  • Recombination

Power: Variance of X

Beavis effect: 1000 is just OK

Multiple testing:

GWAS tests \(m\) hypothesis:

  • No correction: \(\alpha = 0.05/m\)
  • Bonferroni: \(\alpha = 0.05/m\)
  • FDR (25%): \(\alpha = 0.05/(m\times0.75)\)

Linkage mapping

  • Generally on experimental pops (F2, DH, RIL, BC)
  • Based on single-marker analysis or interval mapping
  • Recombination rates would increase power

LD mapping (or association mapping)

  • Use of historical LD between marker and QTL
  • AM allowed studies on random panels
  • Dense SNP panels would increase resolution

Structure

  1. Confounding associations with sub-populations
  2. Major limitation of association mapping
  3. Structure: PCs, clusters, subpopulation (eg. race)

Structure

Imputation

Less missing values = more obs. = more detection power

  • Markov models: Based on flanking markers
  • Random forest: Multiple decision trees capture LD
  • kNN & Projections: Fill with similar haplotypes

GLM (generalized linear models)

  • Full model (\(L_1\)): \[y = Xb + m_ja + e\]
  • Null model (\(L_0\)): \[y = Xb + e\]
  1. Advantage: Fast, not restricted to Gaussian traits
  2. Popular methodology on human genetic studies
  3. \(Xb\) includes (1) environment, (2) structure and (3) covariates

MLM (mixed linear models)

  • Full model (\(L_1\)): \[y = Xb + Zu + m_ja + e\]
  • Null model (\(L_0\)): \[y = Xb + Zu + e\]
  1. The famous "Q+K model"
  2. Advantage: Better control of false positives, no need for PCs
  3. Polygenic effect (\(u\)) assumes \(u\sim N(0,K\sigma^2_u)\)
  4. Faster if we don't reestimate \(\lambda = \sigma^2_e/\sigma^2_u\) for each SNP

cMLM (compressed MLM)

  1. Uses the same base model as MLM
  2. Advantage: Faster than MLM
  3. Based on clustered individuals:
  • \(Z\) is indicates the subgroup
  • \(K\) is the relationship among subgroup
  • Often needs PCs to complement \(K\)

WGR (whole-genome regression)

  1. Tests all markers at once
  2. Advantage: No double-fitting, no PCs, no Bonferroni
  • Model (BayesB, BayesC, SSVS): \[y = Xb + Ma + e\]
  • Marker effects are from a mixture of distributions

\(a_j \sim Binomial\) with \(p(\pi) = 0\) and \(p(1-\pi) = a_j\)

WGR (whole-genome regression)

Rare variants

  1. Screen a set (\(s\)) of low MAF markers on NGS data
  2. Advantage: Detect signals from low power SNPs
  3. Applied to uncommon diseases (not seen in plant breeding)
  4. Two possible model
  • Full model 1 (\(L_1\)): \(y = Xb + M_sa + e\)
  • Full model 2 (\(L_2\)): \(y = Xb + PC_1(M_s) + e\)
  • Null model (\(L_0\)): \(y = Xb + e\)

Test either \(LR(L_1,L_0)\) or \(LR(L_2,L_0)\)

Validation studies

  • QTLs detected with 3 methods, across 3 mapping pops
  • Validations made on 3 unrelated populations