We will focus on two particular problems:
For the second part of our discussion of variability in sequence context, we will turn to the problem of assessing and interpreting genomic variation.
With the term “Genomic Variation” (or variability) we refer to the qualitative and quantitative differences in the genomic sequences of members of a population.
Depending on the extent (single-nucleotide or more extended regions) we may be referring to single nucleotide variants (SNV) or copy number variants (CNV).
Depending on the frequency of the variant in the population we may distinguish between common and rare variants.
Depending on the effect a variant has on the observed phenotype we may be talking about polymorphisms (with little or no effect) and causal variants (with large, and in some cases devastating effects).
Variability is interesting in itself, but genomic variants that are associated to phenotypic properties, especially with pathological conditions, are of extreme importance in biomedicine. Below you see an example of a series of causal SNV in the human CFTR locus which is associated with Cystic Fibrosis. A single nucleotide variant called \(\Delta\)F508 is the most strongly linked variant with the occurrence of the disease.
Large-scale studies of genomic variability are now possible through the application of Whole-Genome or Whole-Exome sequencing.
Which is then followed by a series of computational steps including:
They are identified as positions in the genome with increased variability. Although this is a simplification, you may consider this like a “sliding window” F-test that takes place on every position in the genome. Everytime a nucleotide position shows increased variability beyond a given threshold, it is flagged as a possible variable region. A variant may also be not variable as long as it is different from the reference genome sequence in the same position. (This is because the reference genome, already implies an assessment of variability since the nucleotide at this position is the one most commonly occurring in the population).
Below you see an example of no variability (first two panels), a homozygous variant (second panel, all reads are different than the reference) and a heterozygous variant (fourth panel, approximately half of the reads are variable compared to the genome reference).
#### Interpretation of variants
Once a variant is identified the next steps for its interpretation may include:
It is quite easy to see how variability is assessed in a single genomic position. Let us imagine a simple case of a single nucleotide variant that we want to assess from a functional point of view. This means that we want to realize if it is: a) common or rare b) causal or not
Answering the first question is simple. We just sequence a lot of people at that particular locus and evaluate the frequency the variant.
Answering the second question is much more complex. We first need to have a strong qualitative assumption that a particular position is linked to a particular phenotype. This does not happen very often. But assuming this is the case, we can see how we would approach this question in the following figure.
Here, we compare two groups of people, one with a certain pathological phenotype (left) and one, consisting of healthy individuals (control group, right). By checking the sequence of the particular variant in each group we find 26 out of 32 patients to be heterozygous in the variant (red), while only 2 out of 32 healthy individuals have the same (C/G) genotype, the other 30 being homozygous to the common variant (C/C).
What does this mean? Is the variant likely to be associated with the condition? And if yes, how strongly? How can we assess the difference between heathy individuals and patients?
From a statistical point of view this is a very simple case of assessing a 2X2 contingency table. Ronald Fisher allegedly solved this problem to examine a colleague’s potential to realize whether cream was added before or after tea was poured into her cup. This is why we often refer to the test as Fisher’s test. This is a test of exact statistics, meaning that it may actually calculate the exact probability of a table being like the one observed, given two groups and two conditions and 64 objects distributed randomly between them.
Let’s start by creating the table in R:
genotype<-rbind(c(6,26), c(30,2))
colnames(genotype)<-c("reference", "variant")
rownames(genotype)<-c("Patients", "Healthy")
genotype
## reference variant
## Patients 6 26
## Healthy 30 2
and now let’s apply Fisher’s test on the table:
fisher.test(genotype)
##
## Fisher's Exact Test for Count Data
##
## data: genotype
## p-value = 8.151e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.001596133 0.094608419
## sample estimates:
## odds ratio
## 0.01718739
You see that the test calculates a certain quantity called “odds ratio”, for which it also calculates a 95% confidence interval. The Odds Ratio (or OR) is calculated, as its name implies, as the ratio of the two “odds” of having the variant depending on your condition.
This is shown in the table below
and for our case is equal to (a slightly adjusted) ratio of \(OR=(6/30)/(26/2)\).
Notice how the expected value of OR under randomness is equal to 1 and this is exactly the value against which the test is performed (“true odds ratio is not equal to 1”).
fisher.test(genotype)
##
## Fisher's Exact Test for Count Data
##
## data: genotype
## p-value = 8.151e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.001596133 0.094608419
## sample estimates:
## odds ratio
## 0.01718739
The significance of the OR is evident in the very small value (<0.1), the narrow confidence interval, which is very far from 1 and the p-value of \(<=10^{-10}\).
Odds Ratios are the way we report the tendency of a variant to appear more frequently in one phenotype group than another. They represent the association of that variant for the particular phenotype. In general OR values >> 1 or << 1 are highly indicative of strong association.
Odds Ratios are supported with exact p-values obtained through the application of the Fisher’s test. These p-values are thus the probability of observing a given contingency table (like the one above) given that there is no preference of Groups 1,2 for K1 or K2 categories.
The p-value that is linked with the OR is, as mentioned above, an exact probability calculated with the help of the hypergeometric distribution. (In the figure below, you should interpret the \(\binom{x}{y}\) notation as the number of combinations of \(y\) selected from \(x\).)
This is the case for one locus, but in the real world, we very rarely know the associated locus beforehand. Most of the times we are in search of a locus, or multiple loci that are associated with a condition. This is the case of the Genome Wide Association Studies, (or GWAS) in which multiple loci are tested for association with a given condition.
The result is a huge number of p-values which we often sort according to their size (transformed as \(-log10(p)\)) in plots like the one below. Because of their shape, with few positions having very high -log10(p) (and thus being highly significant) that resemble a skyscraper-skyline, we call them “Manhattan plots”.
These contain thousands of SNVs, for which we calculate the ratio of frequencies between diseased/healthy samples.
The accumulation of information of variant-condition associations allows us to have a concise view of the pathological states that are associated with genome variation and -more importantly- the way with which this happens.
We already new that some diseases, such as the thalassemias or Cystic Fibrosis are associated with only one locus. We call these diseases highly penetrant, meaning that carrying the variant is almost certain to lead to the disease. On the other side of the spectrum, low-penetrance variants are very mildly associated with a condition (that is with very low OR). Penetrance is also related to the frequency with which a condition occurs in the population. A variant may be frequent if it has small penetrance, because this allows its spread in the population. [But in some cases we can have high frequency with high penetrance. Can you think why this happens?]
The relationship between the two is shown in the diagram below for some of the most well-studied genetic diseases.
For this week’s assignment you are asked to perform a motif building and search analysis using a set of defined motif instances and custom-made R functions.
Your goal is to create a PWM with a set of instances of the GATA transcription factor, transform it into PSSM and then search it against a whole genome sequence.
You may start by downloading a set of GATA binding sequences that you will find here.
Once this is done, you can download the target sequence against which you will search the PWM here.
In order to build the PWM you need to read the GATA sequences using this custom function and then create the matrix with the PWM function. In order to use these functions you have to upload them to R using:
source("nameoffunction.R")
This means, that you may use \(readfastafile()\) as:
source("readfastafile.R")
gataseqs<-readfastafile("gata.fa")
and then obtain the PWM with:
source("seqMotif.R")
gataPWM<-seqMotif(gataseqs, drawmotif=F)
The transformation of the PWM to PSSM requires an additional random sequence collection, which you can find here. You can then create the PSSM by taking the log2(ratio) of the two PWMs (one from the GATA sequences over the one from the random sequence).
In the next step you will use a PSSMSearch Function to search for the motif in the longer target sequence. Assuming you have created the PSSM into the table called \(pssm\), the list of commands below should work:
source("pssmSearch.R")
targetset<-readfastafile("test1.fa")
pssmSearch(pssm, targetseq, threshold=X)
The parameter X in the threshold tells the search function to return a different number of successful hits for the search. Threshold can take values from 0 to 1, with 1 returning the instances of the binding sites that have score equal to the maximum (100%) of the PSSM. A value of 0.8 will return values with scores 80% or more of the maximum and thus a greater number of hits.