Candidate Gene Search Tutorial

JT Lovell

2016-11-01

Part 1: Overview

To search for candidate genes you need four objects.

  1. gff - the gene model position dataset
  2. markerBp - the basepair position of markers
  3. cross - the QTL cross object used to identify QTL
  4. interval - the numeric confidence interval of the QTL (chr, lower ci, upper ci) ##### With these data, one can cull lists of genes to those within a QTL interval

To infer the potential of a candidate gene you need at least one of 7 datasets

  1. vcf - the polymorphisms between parents, in “vcf” format. It is optimal to have this annotated by snpEff or similar.
  2. parentGeneExp - results of differential expression analysis between parents
  3. cisEQtl - a list of genes with cis-eQTL
  4. methyl - dataset containing the degree of methylation for each gene
  5. geneDescr - Gene descriptions
  6. GO - GO annotations
  7. geneExp - gene expression of the mapping population ##### With these data, one can infer whether a gene is likely to contain the causal QTN(s)

Part 2: Getting set up

To start, you need the qtlTools package. Get it from github.
library(devtools)
install_github("jtlovell/qtlTools")
library(qtlTools)
A few other packages are needed, install these if you don’t have them
library(qtlpvl)
library(data.table)
Load the multitrait data from R/qtl
data(multitrait)
Create some fake physical positions of the markers allowing for low recombination in the middle of the chromosomes (as would be expected in the pericentromeric region)
map<-pullMap(multitrait)
map$bp<-0
for(i in unique(map$chr)){
  n<-sum(map$chr==i)
  p<-sin((1:n/n)*pi)
  map$bp[map$chr==i]<-cumsum(p*1000000)
}
Create a fake gff file
gff<-data.frame(chr = rep(paste0("scaffold_",1:5),each = 200),
   feature = rep("gene",1000),
   start = rep(seq(from = 0, to = max(map$bp), length = 200), 5),
   end = rep(seq(from = 0, to = max(map$bp), length = 200))+1000,
   strand = rep("+",1000),
   attribute = paste0("gene",1:1000,";","gene",1:1000,".1"), stringsAsFactors=F)

Part 3: Infer the physical position of the genes, using the position of the markers

geneCM<-findGenecM(cross = multitrait, marker.info = map, gff = gff,
   gffCols = c("chr","feature","start","end","strand","attribute"))
## parsing attributes column of gff file
## culling chromosomes to those in the cross
## inferring mapping position for:
## chr 1 (n. features = 200)
## chr 2 (n. features = 200)
## chr 3 (n. features = 200)
## chr 4 (n. features = 200)
## chr 5 (n. features = 200)
Plots showing the bp/cM patterns
par(mfrow=c(3,2))
for(i in unique(map$chr)){
  with(geneCM[geneCM$chr==i,], plot(pos,bp, col="grey", 
                                main = "cM and bp positions of genes and markers",
                                ylab = "physical position (bp)",
                                xlab = "mapping position (cM)"))
  with(map[map$chr==i,], points(pos,bp, col=i, pch = 19, cex=.8))
}

Part 4: Find genes in the interval

Make qtl intervals
s1<-scanone(multitrait, method="hk", pheno.col=1)
perm<-scanone(multitrait, n.perm=100, method="hk",pheno.col=1, verbose=FALSE)
cis<-calcCis(cross = multitrait, s1.output=s1, perm.output=perm, drop=5)

par(mfrow = c(1,1))
plot(s1)
segmentsOnPeaks(multitrait, s1.output=s1, calcCisOutput = cis, int.y = 13.1)

Pull out genes in the intervals
candGenes<-findGenesInterval(findGenecM.output = geneCM, calcCis.output = cis)
print(candGenes)
## $`4@9`
##  [1] "gene601" "gene602" "gene603" "gene604" "gene605" "gene606" "gene607"
##  [8] "gene608" "gene609" "gene610" "gene611" "gene612" "gene613" "gene614"
## [15] "gene615" "gene616" "gene617" "gene618" "gene619" "gene620" "gene621"
## [22] "gene622" "gene623" "gene624" "gene625" "gene626" "gene627" "gene628"
## 
## $`5@35`
##  [1] "gene854" "gene855" "gene856" "gene857" "gene858" "gene859" "gene860"
##  [8] "gene861" "gene862" "gene863" "gene864" "gene865" "gene866" "gene867"
## [15] "gene868" "gene869" "gene870" "gene871" "gene872" "gene873" "gene874"
## [22] "gene875" "gene876" "gene877" "gene878" "gene879" "gene880" "gene881"
## [29] "gene882" "gene883" "gene884" "gene885"

Part 5: next steps

There are a number of approaches to define how likely any gene is to be the candidate.
  1. Genes with non-synonymous SNPs
  2. Genes with cis-eQTL (Lowry et al. 2013, Plant Cell)
  3. Genes with annotations similar to the trait of interest
  4. Covariance of expression and trait of interest in mapping population (Lovell et al. 2015, Plant Cell)
  5. Causal Inference testing