3. An HMM to infer recombination events

Defining the problem we want to solve

We want to identify recombination events in the progeny of a biparental cross for which we have phased and genotyped parents and non-phased, genotyped progeny.

In the terms of our simulated chromatids, we have the parental and maternal genotypes for each of both child chromatids of a diploid individual in the progeny. We also have the combined genotype of both chromatids in the individual of the progeny but not its phasing.

What we want is to recognize parental segments (out of 4 parental chromatides) that contributed to the final genotype in the individual of the progeny, that is the parent.path for each of the chromatids in the the child.

Using our chromatid simulator:

child1.chromatid.mom<-SimulateChromatid()
child1.chromatid.dad<-SimulateChromatid()
child1.chromatid.mom["child"]

## $child
##  [1] 1 1 1 0 0 0 1 0 1 0

child1.chromatid.dad["child"]

## $child
##  [1] 0 0 0 0 1 1 0 1 0 0

Then our diploid “child” genotype would be. (Remember, 0 for homozygous reference, 1 for heterozygous and 2 for homozygous alternate allele).

child1.genotype<-child1.chromatid.mom$child+child1.chromatid.dad$child
child1.genotype

##  [1] 1 1 1 0 1 1 1 1 1 0

And we still have acces to the phased parent genotype.

child1.chromatid.mom$maternal

##  [1] 0 1 1 1 0 0 1 1 1 1

child1.chromatid.mom$paternal

##  [1] 1 1 1 0 1 0 1 0 1 0

child1.chromatid.dad$maternal

##  [1] 0 0 0 1 1 1 1 1 0 1

child1.chromatid.dad$paternal

##  [1] 0 0 0 0 1 1 0 1 0 0

Now using the diploid child genotype and parental phased chromatid genotypes we can infer the parent.path with a Hidden Markov Model (HMM), and compare our inference to the actual parent.paths we have stored from the simulation.

child1.chromatid.mom$parent.path

##  [1] "paternal" "paternal" "paternal" "paternal" "maternal" "maternal"
##  [7] "paternal" "paternal" "maternal" "paternal"

child1.chromatid.dad$parent.path

##  [1] "maternal" "maternal" "paternal" "paternal" "paternal" "paternal"
##  [7] "paternal" "paternal" "paternal" "paternal"

Defining our HMM

lael

According to the diagram, our observed states O_i are dependent on our hidden states G_i. Each hidden state G_i+1 is dependent on the previous G_i state.

Our observations fo the child genotype, O₁, O₂, …, O_n will be one of three possible values or states:

0, 1 or 2.

Our hidden states G₁, G₂, …, G_n will be one of four possible sates:

mom.maternal/dad.maternal, mom.maternal/dad.paternal, mom.paternal/dad.maternal, mom.paternal/dad.paternal

Having the observed and hidden states of our model defined now we need to define the transition and emission probability matrix.

Further defining our HMM, transition and emission probability matrixes

The transition probability matrix has the probabilities of getting certain state in G_i+1 given the state of hidden state G_i.

This will be our transition probability matrix, with r being the recombination rate.

x	mom.maternal/dad.maternal	mom.maternal/dad.paternal	mom.paternal/dad.maternal	mom.paternal/dad.paternal
mom.maternal/dad.maternal	(1-r)(1-r)	(1-r)r	(1-r)r	r*r
mom.maternal/dad.paternal	(1-r)r	(1-r)(1-r)	r*r	(1-r)r
mom.paternal/dad.maternal	(1-r)r	r*r	(1-r)(1-r)	(1-r)r
mom.paternal/dad.paternal	r*r	(1-r)r	(1-r)r	(1-r)(1-r)

It woud be good to have a function to create this transition matrix given a recombination rate.

GetTransitionProb<-function(r){
  # r is the recombination rate
  transition.matrix<-matrix(rep(r*(1-r),16),ncol=4);
  transition.matrix[1,1]=(1-r)^2;
  transition.matrix[2,2]=(1-r)^2;
  transition.matrix[3,3]=(1-r)^2;
  transition.matrix[4,4]=(1-r)^2;
  transition.matrix[4,1]=r^2;
  transition.matrix[3,2]=r^2;
  transition.matrix[2,3]=r^2;
  transition.matrix[1,4]=r^2;
  rownames(transition.matrix)<-c('mom.maternal/dad.maternal','mom.maternal/dad.paternal',
                                 'mom.paternal/dad.maternal','mom.paternal/dad.paternal')
  colnames(transition.matrix)<-rownames(transition.matrix)      
  return(transition.matrix)
}

GetTransitionProb(0.2)

##                           mom.maternal/dad.maternal
## mom.maternal/dad.maternal                      0.64
## mom.maternal/dad.paternal                      0.16
## mom.paternal/dad.maternal                      0.16
## mom.paternal/dad.paternal                      0.04
##                           mom.maternal/dad.paternal
## mom.maternal/dad.maternal                      0.16
## mom.maternal/dad.paternal                      0.64
## mom.paternal/dad.maternal                      0.04
## mom.paternal/dad.paternal                      0.16
##                           mom.paternal/dad.maternal
## mom.maternal/dad.maternal                      0.16
## mom.maternal/dad.paternal                      0.04
## mom.paternal/dad.maternal                      0.64
## mom.paternal/dad.paternal                      0.16
##                           mom.paternal/dad.paternal
## mom.maternal/dad.maternal                      0.04
## mom.maternal/dad.paternal                      0.16
## mom.paternal/dad.maternal                      0.16
## mom.paternal/dad.paternal                      0.64

The emission probability matrix has the probabilities of observing O_i given a G_i

Because we have the genotype and phase of the parents we can use that information in our emission matrix. Depending on the genotypes of the parents we will have different emmission matrixes.

At each locus we have 16 possible genotypes from the parents, the 4 parental chromatids can have 2 possible genotypes, reference or alternate (0 or 1), 2⁴=16.

x	genotype_1	genotype_2	genotype_15
mom.maternal	0	0	1
mom.paternal	0	0	1
dad.maternal	0	1	1
dad.paternal	1	0	1

The emission probability matrix for genotype_2 would look like this:

x	0	1
mom.maternal/dad.maternal	0	1
mom.maternal/dad.paternal	1	0
mom.paternal/dad.maternal	0	1
mom.paternal/dad.paternal	1	0

and for genotype_15 would look like this:

x	0	1	2
mom.maternal/dad.maternal	0	0	1
mom.maternal/dad.paternal	0	0	1
mom.paternal/dad.maternal	0	0	1
mom.paternal/dad.paternal	0	0	1

Now, we can make a function to generate those matrixes given a set of maternal and paternal genotypes.

GetEmissionProb<-function(mom.maternal,mom.paternal,dad.maternal,dad.paternal){
  row.1<-c(as.numeric(mom.maternal+dad.maternal==0),
           as.numeric(mom.maternal+dad.maternal==1),
           as.numeric(mom.maternal+dad.maternal==2))
  row.2<-c(as.numeric(mom.maternal+dad.paternal==0),
           as.numeric(mom.maternal+dad.paternal==1),
           as.numeric(mom.maternal+dad.paternal==2)) 
  row.3<-c(as.numeric(mom.paternal+dad.maternal==0),
           as.numeric(mom.paternal+dad.maternal==1),
           as.numeric(mom.paternal+dad.maternal==2))
  row.4<-c(as.numeric(mom.paternal+dad.paternal==0),
           as.numeric(mom.paternal+dad.paternal==1),
           as.numeric(mom.paternal+dad.paternal==2))      
  
  emission.matrix<-rbind(row.1,row.2,row.3,row.4)
  colnames(emission.matrix)<-c('0','1','2')
  rownames(emission.matrix)<-c('mom.maternal/dad.maternal','mom.maternal/dad.paternal',
                               'mom.paternal/dad.maternal','mom.paternal/dad.paternal')
  
  return(emission.matrix)
}

Now we can test it with the alleles from genotype_0, genotype_2, and genotype_15

GetEmissionProb(0,0,0,0) #genotype_0

##                           0 1 2
## mom.maternal/dad.maternal 1 0 0
## mom.maternal/dad.paternal 1 0 0
## mom.paternal/dad.maternal 1 0 0
## mom.paternal/dad.paternal 1 0 0

GetEmissionProb(0,0,1,0) #genotype_2

##                           0 1 2
## mom.maternal/dad.maternal 0 1 0
## mom.maternal/dad.paternal 1 0 0
## mom.paternal/dad.maternal 0 1 0
## mom.paternal/dad.paternal 1 0 0

GetEmissionProb(1,1,1,1) #genotype_15

##                           0 1 2
## mom.maternal/dad.maternal 0 0 1
## mom.maternal/dad.paternal 0 0 1
## mom.paternal/dad.maternal 0 0 1
## mom.paternal/dad.paternal 0 0 1

Author

Luis Avila (lmavila@gmail.com)