MSc in Molecular Biomedicine

Introduction. The biological problem

In this class we will shift our attention to problems of variability in a more direct biological context. We will be discussing how sequence variability may affect gene regulation through the study of regulatory motifs, before we go on to see how genomic variability in general shapes phenotypic changes in individuals.

We will focus on two particular problems:

A. The study, analysis and interpretation of biological sequence motifs.
B. The concept of genomic variability and how it is directly related to phenotypical variation in a population.

Part B. Genomic Variation

For the second part of our discussion of variability in sequence context, we will turn to the problem of assessing and interpreting genomic variation.

What is genomic variability?

With the term “Genomic Variation” (or variability) we refer to the qualitative and quantitative differences in the genomic sequences of members of a population.

Depending on the extent (single-nucleotide or more extended regions) we may be referring to single nucleotide variants (SNV) or copy number variants (CNV).

Depending on the frequency of the variant in the population we may distinguish between common and rare variants.

Depending on the effect a variant has on the observed phenotype we may be talking about polymorphisms (with little or no effect) and causal variants (with large, and in some cases devastating effects).

Why is it interesting

Variability is interesting in itself, but genomic variants that are associated to phenotypic properties, especially with pathological conditions, are of extreme importance in biomedicine. Below you see an example of a series of causal SNV in the human CFTR locus which is associated with Cystic Fibrosis. A single nucleotide variant called \(\Delta\)F508 is the most strongly linked variant with the occurrence of the disease.

Experimental approach

Large-scale studies of genomic variability are now possible through the application of Whole-Genome or Whole-Exome sequencing.

Which is then followed by a series of computational steps including:

the mapping of the sequences against the genome of reference (remember the Suffix Trees)
the definition of the variants (remember ANOVA)
the interpretation/functional study of the variants

Single nucleotide variants

They are identified as positions in the genome with increased variability. Although this is a simplification, you may consider this like a “sliding window” F-test that takes place on every position in the genome. Everytime a nucleotide position shows increased variability beyond a given threshold, it is flagged as a possible variable region. A variant may also be not variable as long as it is different from the reference genome sequence in the same position. (This is because the reference genome, already implies an assessment of variability since the nucleotide at this position is the one most commonly occurring in the population).

Below you see an example of no variability (first two panels), a homozygous variant (second panel, all reads are different than the reference) and a heterozygous variant (fourth panel, approximately half of the reads are variable compared to the genome reference).

Interpretation of variants

Once a variant is identified the next steps for its interpretation may include:

Filtering out known polymorphic positions
Identification of the genomic location of variants (genic, exonic, intronic etc)
Prediction of the possible impact of the variant (is it lying in conserved sequences? regulatory? coding?)
Cross-comparison with other sources of knowledge for the functional annotation (what could they do)

Assessment of Genomic Variability

It is quite easy to see how variability is assessed in a single genomic position. Let us imagine a simple case of a single nucleotide variant that we want to assess from a functional point of view. This means that we want to realize if it is: a) common or rare b) causal or not

Answering the first question is simple. We just sequence a lot of people at that particular locus and evaluate the frequency the variant.

Answering the second question is much more complex. We first need to have a strong qualitative assumption that a particular position is linked to a particular phenotype. This does not happen very often. But assuming this is the case, we can see how we would approach this question in the following figure.

Here, we compare two groups of people, one with a certain pathological phenotype (left) and one, consisting of healthy individuals (control group, right). By checking the sequence of the particular variant in each group we find 26 out of 32 patients to be heterozygous in the variant (red), while only 2 out of 32 healthy individuals have the same (C/G) genotype, the other 30 being homozygous to the common variant (C/C).

What does this mean? Is the variant likely to be associated with the condition? And if yes, how strongly? How can we assess the difference between heathy individuals and patients?

Statistics Interlude. Odds Ratios

From a statistical point of view this is a very simple case of assessing a 2X2 contingency table. Ronald Fisher allegedly solved this problem to examine a colleague’s potential to realize whether cream was added before or after tea was poured into her cup. This is why we often refer to the test as Fisher’s test. This is a test of exact statistics, meaning that it may actually calculate the exact probability of a table being like the one observed, given two groups and two conditions and 64 objects distributed randomly between them.

Let’s start by creating the table in R:

genotype<-rbind(c(6,26), c(30,2))
colnames(genotype)<-c("reference", "variant")
rownames(genotype)<-c("Patients", "Healthy")
genotype

##          reference variant
## Patients         6      26
## Healthy         30       2

and now let’s apply Fisher’s test on the table:

fisher.test(genotype)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  genotype
## p-value = 8.151e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.001596133 0.094608419
## sample estimates:
## odds ratio 
## 0.01718739

You see that the test calculates a certain quantity called “odds ratio”, for which it also calculates a 95% confidence interval. The Odds Ratio (or OR) is calculated, as its name implies, as the ratio of the two “odds” of having the variant depending on your condition.

This is shown in the table below

and for our case is equal to (a slightly adjusted) ratio of \(OR=(6/30)/(26/2)\).

Notice how the expected value of OR under randomness is equal to 1 and this is exactly the value against which the test is performed (“true odds ratio is not equal to 1”).

fisher.test(genotype)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  genotype
## p-value = 8.151e-10
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.001596133 0.094608419
## sample estimates:
## odds ratio 
## 0.01718739

The significance of the OR is evident in the very small value (<0.1), the narrow confidence interval, which is very far from 1 and the p-value of \(<=10^{-10}\).

Odds Ratios are the way we report the tendency of a variant to appear more frequently in one phenotype group than another. They represent the association of that variant for the particular phenotype. In general OR values >> 1 or << 1 are highly indicative of strong association.

Odds Ratios are supported with exact p-values obtained through the application of the Fisher’s test. These p-values are thus the probability of observing a given contingency table (like the one above) given that there is no preference of Groups 1,2 for K1 or K2 categories.

The p-value that is linked with the OR is, as mentioned above, an exact probability calculated with the help of the hypergeometric distribution. (In the figure below, you should interpret the \(\binom{x}{y}\) notation as the number of combinations of \(y\) selected from \(x\).)

Complexity of Statistics Interlude. Multiple loci

This is the case for one locus, but in the real world, we very rarely know the associated locus beforehand. Most of the times we are in search of a locus, or multiple loci that are associated with a condition. This is the case of the Genome Wide Association Studies, (or GWAS) in which multiple loci are tested for association with a given condition.

The result is a huge number of p-values which we often sort according to their size (transformed as \(-log10(p)\)) in plots like the one below. Because of their shape, with few positions having very high -log10(p) (and thus being highly significant) that resemble a skyscraper-skyline, we call them “Manhattan plots”.

These contain thousands of SNVs, for which we calculate the ratio of frequencies between diseased/healthy samples.

Functional interpretation of Genomic Variability

The accumulation of information of variant-condition associations allows us to have a concise view of the pathological states that are associated with genome variation and -more importantly- the way with which this happens.

We already new that some diseases, such as the thalassemias or Cystic Fibrosis are associated with only one locus. We call these diseases highly penetrant, meaning that carrying the variant is almost certain to lead to the disease. On the other side of the spectrum, low-penetrance variants are very mildly associated with a condition (that is with very low OR). Penetrance is also related to the frequency with which a condition occurs in the population. A variant may be frequent if it has small penetrance, because this allows its spread in the population. [But in some cases we can have high frequency with high penetrance. Can you think why this happens?]

The relationship between the two is shown in the diagram below for some of the most well-studied genetic diseases.

A short story about bias. Intuitive Fisher’s test application, Cultural Cognition and the threats and responsibilities of being a scientist.

The most depressing brain finding ever…

In 2012 a group led by Dan M. Kahan of Yale Law School, published this study. This work is an application of Kahan’s Theory of Cultural Cognition according to which, there exists among people a tendency to perceive risks and related facts in relation to personal values.

That paper was coined by many as the…

because of its key finding, which basically was that, in certain contexts, it is the smartest people that make the wrong decisions.

The main question Kahan et al. wanted to answer was: “Why does public conflict over societal risks persist in the face of compelling and widely accessible scientific evidence?” They set out to inquire whether, in the process of making judgement on data, people tend to take full advantage of their cognitive capacities or if they are constrained by some other factors, which they described as “motivated cognition”. In short, the question was: “In face of difficult questions, do intelligent people do better?”

The experiment

They proceeded in the following way. They started with a large group of people and used a standardized numeracy test. Numeracy is [a measure of the ability and disposition to make use of quantitative information] (https://www.numericalreasoningtest.org/#). They then split the group in two and gave the following question to the first half of them: A group of people uses a new skin cream to treat a rash and reports whether it got better or worse. An independent control group of people with rashes reports the same after the same period, without applying any treatment. The data are shown in the table below.

Take a minute and think about it. What would your answer be, now that you have understood how the Fisher’s test works?

While many people, in the lower numeracy percentiles went with the high number of 223 and reported that it is better to use the cream, the correct answer is actually the opposite. The ratio of people who got better/worse using the cream is close to 2, while the same ratio for those who didn’t is close to 5.

In order to remove biases for positive or negative inclinations (i.e. some people are more prone to saying “yes” or “no” spontaneously), they presented the same data with the opposite outcomes.

As expected, subjects highest in numeracy, did substantially better than less numerate ones when the data were presented as results from a study of a new skin rash treatment.

All well up to this point.

But then the researchers presented the same data in the contingency table to describe a different question, this time one that is loaded with a certain political context. They presented a separate, independent group of people with the fictional scenario of cities imposing restrictions on carrying concealed handguns and the outcome of crime rates compared to control cases where such restrictions were not in place. They used identical numbers as in the skin cream question.

And then this happened.
Compare the two plots. In the top, a “benign question” leads to an expected increase in correct answers as numeracy goes up. But when you look at the bottom you see very little effect of the numeracy in getting the correct answer.

This is actually explained by the preexisting political biases that come into play when attempting to reason with data related with a politically biased question. Below you see the same responses when split in subgroups depending on political associations. Left-leaning people perform according to their numeracy only when the data support their pre-existing inclination to think that constraints on handguns reduces crime rate. Right-leaning people do the same when the data support the opposite.

What is even more disturbing is that the performance of both groups, when faced with data that go against their pre-existing thesis, is actually getting worse as their numeracy increases.

This is shown in the graph below, in which the researchers have measure the loss of accuracy depending on the numeracy of the subjects. People with high numeracy actually do worse. Smarter people are less likely to perform according to their numeracy when faced with politically-biased questions. Instead, they will tend to stick to their pre-existing notion, in something that is more similar to prejudice, or belief than actual reasoning.

Kahan et al. called this “Motivated Numeracy”.

Motivated Cognition

You can understand this better when looking at the graph below. Left-leaning democrats and right-leaning republicans have opposing views on gun control. But these opposing views become much more expressed among people who are in general better in dealing with numbers.

This is not an observation that is limited to gun control. It extends to many other issues related to the environment, vaccines, tax reforms etc.

And it does not only apply to numeracy but to overall intelligence, measure through IQ.

Or the capacity for analytical thinking as shown by this Cognitive Reflection Test Score (i.e. how well do people use “slow” thinking instead of jumping to conclusions).

The authors explain this through their theory of Motivated Cognition. According to this, there is a inherent tendency for people to protect their identity. In order to stick with the pre-existing perceptions of their cultural group and to avoid cultural conflict they actually disable their cognitive faculties. They unconsciously opt for the wrong decision in order not to disrupt the cohesion of their cultural, political or other type of group they identify with.

Depressing? It surely is something to reflect upon even from the scientific point of view. To what extent do we fix our decision making to pre-existing biases even in the face of data that say otherwise? Are there identity groups in the scientific world, in the form of people that support one or another theory?

Scientific Curiosity

In 2017, five years after the original “most depressing study”, the same group was performing analyses on how likely people were to respond to scientific documentaries. They used a similar standardized scoring scheme but instead of numeracy, they now scored people according to their scientific curiosity. This was based on how likely people were to engage with scientific findings, documentaries, lay articles etc.

With this information in hand, they turned back to their previous study and measured the Cultural Cognition prominence dependent now not on numeracy or IQ but on scientific curiosity. When compared to that, the cultural and political biases were gone. People still have some original predispositions that are politically-dependent, but their ability to reason increases with increasing scientific curiosity.

This is associated with a true open-mindedness. People with high scientific curiosity will be almost twice as likely to access a news story or information that goes against their political predisposition. And this is true for both left- and right-leaning individuals.

A positive and optimistic conclusion from this is that, having a genuine scientific curiosity, wanting to know how nature and things work and what wonders may still lie hidden in the cosmos, is not only the real open-mindedness but also the best way to use one’s intellectual capacities in an unbiased and constructive manner.

Now, get out there and be curious!

Assignment

For this week’s assignment you are asked to perform a motif building and search analysis using a set of defined motif instances and custom-made R functions.

Your goal is to create a PWM with a set of instances of the GATA transcription factor, transform it into PSSM and then search it against a whole genome sequence.

You may start by downloading a set of GATA binding sequences that you will find here.
Once this is done, you can download the target sequence against which you will search the PWM here.
In order to build the PWM you need to read the GATA sequences using this custom function and then create the matrix with the PWM function. In order to use these functions you have to upload them to R using:

source("nameoffunction.R")

This means, that you may use \(readfastafile()\) as:

source("readfastafile.R")
gataseqs<-readfastafile("gata.fa")

and then obtain the PWM with:

source("seqMotif.R")
gataPWM<-seqMotif(gataseqs, drawmotif=F)

The transformation of the PWM to PSSM requires an additional random sequence collection, which you can find here. You can then create the PSSM by taking the log2(ratio) of the two PWMs (one from the GATA sequences over the one from the random sequence).
In the next step you will use a PSSMSearch Function to search for the motif in the longer target sequence. Assuming you have created the PSSM into the table called \(pssm\), the list of commands below should work:

source("pssmSearch.R")
targetset<-readfastafile("test1.fa")
pssmSearch(pssm, targetseq, threshold=X)

The parameter X in the threshold tells the search function to return a different number of successful hits for the search. Threshold can take values from 0 to 1, with 1 returning the instances of the binding sites that have score equal to the maximum (100%) of the PSSM. A value of 0.8 will return values with scores 80% or more of the maximum and thus a greater number of hits.

As a last step you are asked to do the following. Perform three different searches with threshold=0.5, 0.8 and 1.0 and obtain the results in a vector. Then use this vector of positions to extract the sequences from the target sequence (test1.fa). Then store these sequences in a table (you can use “write.table”) and paste them to the WebLogo WebService to obtain three different sequence logos. Compare the three and discuss their differences.