2024-01-27

Hypothesis Testing

Whether designing an experiment to collect data or investigating trends in existing data, researchers will design a question to inform their efforts such as;

  • “Is the rate of inheritance for two genes different from expected if they were unlinked?”
  • “Is cancer incidence in mammals related to their litter size?”

Often, collecting data from the entirety of a population is infeasible. We are able to draw conclusions about the population by looking at a well-selected representative sample.

Hypothesis testing allows us to determine if a correlation in a sample is significant enough to draw a conclusion about the population, and to what degree that evidence is substantial.

Hypothesis Testing

H0 Null Hypothesis

The null hypothesis represents the case in which the sample data does not differ from the population data, i.e., (informally) nothing interesting is going on.

Ha Alternate Hypothesis

The alternate hypothesis represents the hypothesis to be tested; that is, the presence of a trend. It is directly contradictory to the null hypothesis.


It’s important to note that we cannot definitively accept the null hypothesis; instead, we write that we “fail to reject the null hypothesis,” as a lack of evidence does not necessarily indicate that a trend does not exist.

Chi-Square Tests

We can demonstrate the validity of our hypothesis using statistics with a Chi-Square (χ2) Test. Two types of this test commonly used are:

Chi-Square Goodness of Fit Test

Tests if the given data is well-represented by a given model.

Chi-Square Test of Independence

Tests if the correlation between two different variables is significant.

We will be demonstrating the single-variable Goodness of Fit test.

The Goodness of Fit Test

The formula for a Goodness of Fit Test is as follows:

\(χ^2=\sum\frac{(o_i-e_i)^2}{e_i}\), where:

Observed Value (oi)

The experimental/observed data you wish to test.

Expected Value (ei)

Expected data from the distribution model you wish to test your sample data against.

The Goodness of Fit Test (cont.)

Test statistic (χ2)

The value which is compared against a known distribution of critical values to determine if the results are significant.

  • If χ2 > critical value, the results are insignificant, and we fail to reject the null hypothesis
  • If χ2 < critical value, the results are significant, and we reject the null hypothesis

The Goodness of Fit Test (cont.)

Degrees of Freedom (df)

The number of categories that are free to vary.

  • Equal to the number of categories (k) minus one.
  • e.g. If your potential outputs are heads or tails, your df = (k-1) = (2-1) = 1.

Alpha (α)

The degree of certainty you wish to perform your test with. Alongside degrees of freedom, this statistic will inform which critical value is used to compare the test statistic to.

  • e.g., α = 0.01 means that you want to be 99% certain that the sample distribution is not due to chance.

Application in Dihybrid Crosses

As a real-world example, let’s consider our first hypothesis,

“Is the rate of inheritance for two genes different from expected if they were unlinked?”

This is a great question for a Goodness of Fit test because we have a well-substantiated model by which to compare single-variable sample data.

Application in Dihybrid Crosses (cont.)

Mendelian inheritance predicts a 9:3:3:1 ratio of offspring phenotypes in a dihybrid cross for two genes expressing complete dominance. If a cross is performed and the offspring express a different phenotypic ratio, these results could be due to chance or due to gene linkage.


Phenotype: A characteristic of an organism determined by the arrangement of alleles in a gene.

Dihybrid Cross: A cross performed between two parents which are both heterozygous (BbEe) for two genes.

Complete Dominance: The presence of the dominant allele (B) will fully mask the effects of the recessive allele (b) on phenotype.

Gene Linkage: Genes closer together on the same chromosome may have a higher likelihood of being inherited together.

Application in Dihybrid Crosses (cont.)

Imagine we cross two heterozygous fruit flies (BbEe), where:

  • The body color gene, where B codes for brown bodies and b codes for black bodies
  • The eye color gene, where E codes for red eyes and e codes for brown eyes.

We would expect to see a cross as follows:

The 9:3:3:1 offspring phenotype ratio expected by laws of independent assortment and Mendelian genetics. (Some Genes Are Transmitted to Offspring in Groups via the Phenomenon of Gene Linkage, 2014)

Application in Dihybrid Crosses (cont.)

For an experimental cross of heterozygous, Brown-bodied, red-eyed (BbEe) fruit flies, the following offspring distribution was obtained:

Phenotype Observed
Brown body, Red eyes 1284
Brown body, Brown eyes 355
Black body, Red eyes 444
Black body, Brown eyes 98


We would like to investigate whether the genes for body color and eye color are linked.

Application in Dihybrid Crosses (cont.)

Application in Dihybrid Crosses (cont.)

Null Hypothesis

The distribution of fly offspring will follow a 9:3:3:1 ratio.

Alternate Hypothesis

The distribution of fly offspring will not follow a 9:3:3:1 ratio.

Application in Dihybrid Crosses (cont.)

Perform Goodness of Fit Test

We can use the R function chisq.test() to test our hypothesis. chisq.test(x, p) takes two parameters for our calculation; a numeric vector X, and a vector of probabilities p, where length(x) = length(p)

probabilities <- c(9/16, 3/16, 3/16, 1/16)
flychi <- chisq.test(flycounts$Observed, p=probabilities)
flychi
## 
##  Chi-squared test for given probabilities
## 
## data:  flycounts$Observed
## X-squared = 23.554, df = 3, p-value = 3.094e-05

Application in Dihybrid Crosses (cont.)

Before we interpret the significance of these results, apply the expected model to the total offspring count.

flycounts$Expected = round(flychi$expected, 0)
kable(flycounts)
Phenotype Observed Expected
Brown body, Red eyes 1284 1227
Brown body, Brown eyes 355 409
Black body, Red eyes 444 409
Black body, Brown eyes 98 136

Application in Dihybrid Crosses (cont.)

Here is a barplot of the observed vs. expected data.

The observed values are visually different from the expected values, but is this significant enough to suggest linkage?

Application in Dihybrid Crosses (cont.)

The package gginference contains a function that allows us to plot Chi-Square distribution for our data.

Application in Dihybrid Crosses (cont.)

\[23.554 > 7.815\]

Our test statistic is larger than the critical value for a distribution with df = 3, and we are therefore able to reject the null hypothesis; something in this offspring distribution is different from what we would expect from a typical dihybrid distribution, and we are 95% certain this is not due to chance.

The distribution we obtained suggests that the genes for body color and eye color are linked, as well as suggesting potential allelic arrangements on the parental chromosomes.

References