Let review by performing Gregor Mendel’s chi sq test in R, a package that helps is MASS. First, install the package and then run the library
After running the library, we need to tell R what the observed and expected values are.
First, let’s put all of this in a data frame. In Mendel’s experiment, he observed 416 yellow peas and 140 green peas. The expected outcome, based on what we now call Mendelian genetics, is 3:1, or 0.75 to 0.25. To run a chi sq test, we need to put this into objects. For observed, i put in the observed pea ratios. For expected, i put in the proportions expected, which must equal 1.
Let’s assign these to the following object names.
observed <- c(416, 140)
expected <- c(.75, .25)
Let’s make some statistical hypotheses.
H0: If crosses follow Mendelian laws, the observations (number of each type of F2) in each category (Yellow/Green) is equal to the expected (3:1 phenotypic ratios). HA: Crosses do not follow Mendelian laws, and the observations (number of each type of F2) in each category (Yellow/Green) will deviate from the expected (3:1 phenotypic ratios).
Notice that the data show the following ratios.
# The fraction of Yellow peas is
416/(416+140)
## [1] 0.7482014
# The fraction of Green peas is
140/(416+140)
## [1] 0.2517986
The values are very close to the expected fractions. So, we probably don’t expect a statisitcal test to tell suggest to us to reject our null hypothesis. But, we should always run the statistics!
Ok. Now we are ready to run the test. For this, there is a chisq.test() function. I’ve added a parameter (correct = FALSE) because i don’t want R to perform any corrections, which I don’t at this point.
?chisq.test
chisq.test(x=observed, p=expected, correct = FALSE)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 0.0095923, df = 1, p-value = 0.922
Let’s also review the terms. R reports a chi sq value (X-squared), a degrees of freedom (df), and a p-value.
A chi sq value is the result of your chi sq test. Ususally, higher chi sq values lead to data where you can reject a null hypothesis. However, this will depend on your degrees of freedom (df). The df is then number of possibilities in a variable; in this case, Mendel collected data that classified the color variable as either yellow or green. Thus, yellow and green are the two possible result or options, and the df is n-1 (options - 1). So, the df = 1.
Then you get a p-value. This is the probability of getting the observed result, or a more extreme result, if the null hypothesis is true.
So for this case, our p value is 0.922, which is really high. Thus, we would say that our data supports our null hypothesis, and we fail to reject the null hypothesis.
Looks like Mendel’s experiment follows his law exactly! Data too good? Well Mendel took good notes, and had a robust sample size. The numbers allowed him to think pretty clearly about the rules of inheritance.
What if we repeated Mendel’s experiments, but only did a small number of peas. Let’s say we got these results.
Yellow: 33 (0.89)
Green: 4 (0.11)
At first glance, this could be different than the 0.75:0.25 ratio.
observed2 <- c(33, 4)
expected <- c(.75, .25)
chisq.test(x=observed2, p=expected, correct = FALSE)
##
## Chi-squared test for given probabilities
##
## data: observed2
## X-squared = 3.973, df = 1, p-value = 0.04624
Using the normal chi square test suggest that we should reject the null hypothesis. However, the sample size is small, suggesting that we might want to reconsider our data before making big claims about a new model of genetic inheritance. To do this, we can perform a simulation of the data, in something called Monte Carlo correction. In R, we can run a simulated chi square test. This is how that is done.
#Monte Carlo correction
chisq.test(x=observed2, p=expected, simulate.p.value = TRUE)
##
## Chi-squared test for given probabilities with simulated p-value (based
## on 2000 replicates)
##
## data: observed2
## X-squared = 3.973, df = NA, p-value = 0.04548
Indeed, this now suggest that perhaps our data is more borderline. Maybe we should replicate our experiment before publishing our big paper!
We haven’t graphed this data yet! So far, we have talked about boxplot and scatterplots. How about bargraphs.
To start, let’s make Mendel’s Yellow/Green pea plant experiment into a dataframe
mendel <- data.frame(observed, expected)
mendel
## observed expected
## 1 416 0.75
## 2 140 0.25
Now, let’s assign row names.
rownames(mendel) <- c("Yellow", "Green")
Now, let’s do a simple barplot.
barplot(mendel$observed, beside=T)
?barplot()
Now, let’s clean up this barplot
barplot(mendel$observed, beside=T, legend = rownames(mendel), xlab="Pea color", col = c("yellow", "green"))
Great! You made a barplot. We will be making much more advanced plots in the future, but this is a good start! Originally, we were going to do ggplots earlier, but we will save that for later!
What if you have multiple phenotypes? For example, Gregor Mendel not only analyzed color of peas, but their shape. He described peas as having color: Yellow (Y) and Green (y), as discussed above. But he also described them for shape: Round (R) and Wrinkled (r). Similar to color, the genetics of shape can also be dominant/recessive, such that Round peas are more likely than Wrinkled peas. He did the cross where he had YyRr X YyRr and looked for progeny. He developed the 9:3:3:1 rule, which states that 9/16 will be Yellow and Round, 3/16 will be Yellow and Wrinkled, 3/16 will be Green and Round, and 1/16 will be Green and Wrinkled. This became the bases of the law of independent assortment (ie, that Color and Shape will follow their own rules)
#### Challenge 1
Given the following findings from Mendel, are there significant differences between his observations and the expected ratios? Do a chi square test, and report your chi square value, df, and p value. Then, make a barchart with this data.
Yellow and Round: 315
Yellow and Wrinkled: 101
Green and Round: 108 Green and Wrinkled: 32
For some non-parametric studies, you might have multiple variables. See your text (Handbook of Biological Statistics) for examples. In class, we will make a sample experiemnt and use the Chi square test of independence to test our results. The Chi-square test of independence is used to determine whether two nominal variables are likely to be related or not.
For example, let’s say the owner of a movie theater wants to estimate how many snacks to buy. If movie type and snack purchases are unrelated, estimating will be simpler than if the movie types impact snack sales. We have a list of movie genres; this is our first variable. Our second variable is whether or not the patrons of those genres bought snacks at the theater. Our idea (or, in statistical terms, our null hypothesis) is that the type of movie and whether or not people bought snacks are unrelated.
Let’s imaging the data from this table
TypeofMovie <- c("Action", "Comedy", "Family", "Horror")
Snacks <- c(50, 125, 90, 45)
NoSnacks <- c(75, 175, 30, 10)
Movie <- data.frame(TypeofMovie, Snacks, NoSnacks, row.names =1)
# we need to remove rownames in order for the test to be read properly
Movie
## Snacks NoSnacks
## Action 50 75
## Comedy 125 175
## Family 90 30
## Horror 45 10
#### Challenge 2
Using your worksheet, fill out the tables by hand.
chisq <- chisq.test(Movie)
chisq
##
## Pearson's Chi-squared test
##
## data: Movie
## X-squared = 65.012, df = 3, p-value = 4.987e-14
chisq$expected
## Snacks NoSnacks
## Action 64.58333 60.41667
## Comedy 155.00000 145.00000
## Family 62.00000 58.00000
## Horror 28.41667 26.58333
We can also visualize these results with a slightly different type of graph, called a balloon graph. This graph gives measurement data represented by different sized circles, or balloons, to depict amount.
library("gplots")
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
# 1. convert the data as a table
movietable <- as.table(as.matrix(Movie))
# 2. Graph
balloonplot(t(movietable), main ="movie snacks", xlab ="", ylab="",
label = FALSE, show.margins = FALSE)
Now, let’s make one for class and have you complete. Poll the other students in your class. On the board, we will fill out the following information. Name Commuter (Y/N) Athlete Status (Y/N) Drinks coffee in the morning (Y/N) Wakes up before 7am (Y/N) Eats “healthy” breakfast (Y/N) Wears Marian Swag more than 3 days a week (Y/N)
As a class, we will decide which data to run our stats on. Once we do, answer the following questions.
#### Challenge 3
What is the statisitcal null hypothesis?
Using R or by hand, run the chi square test Make a balloon plot of your data.
MENDELIAN GENETICS, PROBABILITY, PEDIGREES, AND CHI-SQUARE STATISTICS INTRODUCTION Hemoglobin is a protein found in red blood cells (RBCs) that transports oxygen throughout the body. The hemoglobin protein consists of four polypeptide chains: two alpha chains and two beta chains. Sickle cell disease (also called sickle cell anemia) is caused by a genetic mutation in the DNA sequence that codes for the beta chain of the hemoglobin protein. The mutation causes an amino acid substitution, replacing glutamic acid with valine. Due to this change in amino acid sequence, the hemoglobin tends to precipitate (or clump together) within the RBC after releasing its oxygen. This clumping causes the RBC to assume an abnormal “sickled” shape.
Individuals who are homozygous for the normal hemoglobin allele (HBA) receive a normal hemoglobin allele from each parent and are designated AA. People who are homozygous for normal hemoglobin do not have any sickled RBCs. Individuals who receive one normal hemoglobin allele from one parent and one mutant hemoglobin, or sickle cell allele (HBS), from the other parent are heterozygous and are said to have sickle cell trait. Their genotype is AS. Heterozygous individuals produce both normal and mutant hemoglobin proteins. These individuals do not have sickle cell disease, and most of their RBCs are normal. However, due to having one copy of the sickle cell allele, these individuals do manifest some sickling of their RBCs in low-oxygen environments. People with sickle cell disease are homozygous for the sickle cell allele (SS genotype); they have received one copy of the mutant hemoglobin allele from each parent. The resulting abnormal, sickle-shaped RBCs in these people block blood flow in blood vessels, causing pain, serious infections, and organ damage.
1.Watch the short film The Making of the Fittest: Natural Selection in Humans, available on here: https://www.biointeractive.org/classroom-resources/making-fittest-natural-selection-humans. While watching, pay close attention to the genetics of sickle cell trait and the connection to malaria infection.
Challenge 4. Answer the following questions regarding genetics, probability, pedigrees, and the chi-square statistical analysis test. Most can be answered on your worksheet.