Markdown Author: Jessie Bell, 2023
Libraries Used: none
Null and Alternative Hypotheses The chi-squared goodness-of-fit test allows researchers to determine whether samples from a population are likely, given their expected proportions (which are converted to frequencies). The statistical
hypotheses are stated as:
• \(H_0\): The proportion of each group from the sampled population equals the expected proportions for that group
• \(H_A\): The proportion of each group from the sampled population does not equal the expected proportions for that group
Make a frequency table of categorical variables. You make up the data yourself. Share your table and R code that you used to make it in your write up.
color.data <- c("red", "blue", "red", "red", "orange", "blue", "red", "orange
")
(color.data2 <- c(red = 4, blue = 2, orange = 2))
## red blue orange
## 4 2 2
color.data2 <- data.frame(c("cyan" = 4, "purple" = 2, "orange" = 10))
color.data3 <- table(color.data)
color.data3
## color.data
## blue orange orange\n red
## 2 1 1 4
# Sea star data
obs.freqs <- c(purple = 154, orange = 54)
exp.props <- c(purple = 0.75, orange = 0.25)
exp.freqs <- sum(obs.freqs) * exp.props
chisq.val <- sum(((obs.freqs - exp.freqs)^2) / exp.freqs)
# Displaying the results
exp.freqs
## purple orange
## 156 52
chisq.val
## [1] 0.1025641
# Plotting chi-squared distributions
curve(dchisq(x, df = 4), 0, 20, ylab = "Density")
curve(dchisq(x, df = 5), add = TRUE, col = "steelblue")
curve(dchisq(x, df = 6), add = TRUE, col = "steelblue1")
Share your work for the hypothesis test. Include stating Null and Alternative Hypotheses, calculating the test statistic, calculating the critical value, and making a comparison.
#manaully
qchisq(0.95, df = 1)
## [1] 3.841459
#Remember you want the right tail!
1- pchisq(chisq.val, df = 1)
## [1] 0.748774
# Running chi-squared test
result <- chisq.test(obs.freqs, p = exp.props, correct = FALSE)
# Displaying the results
result
##
## Chi-squared test for given probabilities
##
## data: obs.freqs
## X-squared = 0.10256, df = 1, p-value = 0.7488
#p-value = 0.7488 which checks out!
#df = 1
curve(dchisq(x, df = 1), 0, 20, ylab = "Density")
#Add chi-squared value to graph
abline(v = chisq.val, col = "firebrick4")
text(chisq.val, 0.1, "Test stat", pos = 4, col = "firebrick4")
#Add critical value to graph
abline(v = qchisq(0.95, df = 1), col = "steelblue")
text(qchisq(0.95, df = 1), 0.8, "Critical value", pos = 4, col = "steelblue")
xVals <- barplot(obs.freqs, ylim = c(0, 220), ylab = "Frequency", col=obs.freqs)
#xVals now have the center of each bar
xVals
## [,1]
## [1,] 0.7
## [2,] 1.9
#I now play around with the values to get the lines right
lines(c(xVals[1]-0.6, xVals[1]+0.6), rep(exp.freqs[1], 2), lty = 5)
lines(c(xVals[2]-0.6, xVals[2]+0.6), rep(exp.freqs[2], 2), lty = 5)
text(200, labels="--- expected frequencies")
• \(H_0\): The two variables are independent
• \(H_A\): The two variables are not independent
herbivore <- c("Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No",
"No", "No", "Yes")
phenolics <- c("No", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No",
"Yes", "No", "Yes")
#You can combine the data into a data.frame
(myData <- data.frame(herbivore, phenolics))
## herbivore phenolics
## 1 Yes No
## 2 Yes Yes
## 3 No No
## 4 Yes No
## 5 Yes Yes
## 6 Yes Yes
## 7 No No
## 8 Yes Yes
## 9 No No
## 10 No Yes
## 11 No No
## 12 Yes Yes
write.csv(myData, "inducible_defense_data.csv")
(conTable <- table(myData))
## phenolics
## herbivore No Yes
## No 4 1
## Yes 2 5
(conTable2 <- matrix(c(4,1,2,5), 2, byrow = T, dimnames = list(herbivores =
c("No", "Yes"), phenolics = c("No", "Yes"))))
## phenolics
## herbivores No Yes
## No 4 1
## Yes 2 5
Color polymorphism in sea stars
Let’s continue to analyze our sea star data from Harley et al (2006). They measured the frequencies of the two common color morphs at many sites, not just Strawberry Hill. In fact, they sample Hat Island, which is a small island just south of Bellingham at the mouth of Padilla Bay. At Hat Island they sampled 142 individuals and all but one was purple! So, let’s ask whether the color is independent of site. What is your guess given the data? Does it seem extreme? Let’s enter the data again manually:
#Hat Island data
HI <- c(purple = 141, orange = 1)
#Strawberry Hill data
SH <- c(purple = 154, orange = 54)
(obsFreqs <- matrix(c(HI, SH), 2,
dimnames = list(color = c("Purple", "Orange"), site = c("Hat Island",
"Strawberry Hill"))))
## site
## color Hat Island Strawberry Hill
## Purple 141 154
## Orange 1 54
Unlike the Chi-squared goodness-of-fit test, the Chi-squared test for independence requires that we calculate the expected values from the observed data. We do this with something called a contingency table. There is a video online that goes through this process with you by hand. The following exercise will R functions to calculate the contingency table. Expected Values So let’s calculate the expected values from the contingency table. An easy way to calculate the row, column, and total sums is with the function addmargins(). You can then extract the column and row totals and calculate the column and row probabilities— extract the correct values by using hard brackets and divide by the total number of observations. The function outer() will then calculate the all the shared probabilities and then just multiple this matrix by the total number of observations. For a small data set like this, by hand might be quicker. However, for a large data set, using the functions in R will save you a lot of time. Remember the expected values are easily calculated by multiplying the row total and column total of interest and then dividing by the grand total.
(obsFreqsMar <- addmargins(obsFreqs))
## site
## color Hat Island Strawberry Hill Sum
## Purple 141 154 295
## Orange 1 54 55
## Sum 142 208 350
#Order matters, do rows and then columns for this example
(expFreqs <- outer(obsFreqsMar[1:2,3], obsFreqsMar[3,1:2])/obsFreqsMar[3,3])
## Hat Island Strawberry Hill
## Purple 119.68571 175.31429
## Orange 22.31429 32.68571
#Recommend double checking your answers
We can now use the equation for the Chi-squared test statistic to to calculate the Chi-squared value.
\(\chi^2\) = \(\sum^g_i\)\(_=\)\(_1\) \((0_i-E_i)^2/E_i\)
where O are the observed values, E are the expected values, and r is the number of row, c is the number of columns.
R makes this calculation very easy because it pairs up the values in a given cell. For example, when you add two matrices, it will add the value in the upper-left in one matrix to the value in the upper-left of the other matrix.
We looked at the Chi-squared distribution in the Chi-squared goodness-of-fit test webpage. It is exactly the same distribution for both the Chi-squared. To calculate the degrees of freedom for both test, multiple the number of rows - 1 by the number of columns - 1.
\(df = (r-1)*(c-1)\)
So, for our example, we have 2 rows and 2 columns, and thus the degrees of freedom are 1. We can now calculate the P-values for both of our test statistics (the Chi-squared value).
(chiVal <- sum(((obsFreqs-expFreqs))^2/expFreqs))
## [1] 40.6452
pChi <- 1 - pchisq(chiVal, 1)
curve(dchisq(x, 1), 0, 60, ylab = "Density")
abline(v = c(chiVal, qchisq(0.95, 1)), col = "red")
text(c(chiVal, qchisq(0.95, 1)), 0.3, c("Chi-squared value", "Critical value"
), col = "red")
Share your work for this hypothesis
\(H_0\): The proportion of each group from the sampled population equals the expected proportions for that group
\(H_A\): The proportion of each group from the sampled population does not equal the expected proportions for that group
# Calculating expected frequencies
exp.freqs <- sum(obs.freqs) * exp.props
# Calculating the chi-squared test statistic
chi_squared_values <- ((obs.freqs - exp.freqs)^2) / exp.freqs
chi_squared_statistic <- sum(chi_squared_values)
chi_squared_statistic
## [1] 0.1025641
# Degrees of freedom for the chi-squared test
df <- length(obs.freqs) - 1
# Significance level (alpha)
alpha <- 0.05
# Calculating the critical value
critical_value <- qchisq(1 - alpha, df)
critical_value
## [1] 3.841459
Share your work for this quick hypothesis test. How does the result compare with the “manual” version of running the test?
chisq.test(obsFreqs, correct = F)
##
## Pearson's Chi-squared test
##
## data: obsFreqs
## X-squared = 40.645, df = 1, p-value = 1.825e-10
Fisher’s Test
fisher.test(obsFreqs)
##
## Fisher's Exact Test for Count Data
##
## data: obsFreqs
## p-value = 7.659e-13
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 8.215059 1982.702767
## sample estimates:
## odds ratio
## 49.08337
mosaicData <- obsFreqs/matrix(c(142, 208, 142, 208), 2, byrow = T)
#Rev the matrix so site is on the x axis
mosaicData2 <- t(mosaicData)
mosaicplot(mosaicData2,
main = "hi I am a title I",
col = c("darkorchid4", "orange"),
ylab = "Color",
xlab = "Site")
opar <- par()
par(mar = c(4, 4, 1, 8) + 0.1)
barplot(obsFreqs/matrix(c(142, 208, 142, 208), 2, byrow = T),
legend = T,
width = c(142, 208),
ylab = "Relative Frequency",
col = c("darkorchid4", "darkorange"),
args.legend = list(x = "topright", bty = "n", inset = c(-0.3, 0.4)),
main="hi I am a title II")
barplot(obsFreqs/matrix(c(142, 208, 142, 208), 2, byrow = T),
beside = T,
legend = T,
ylim = c(0, 1),
ylab = "Relative Frequency",
col = c("darkorchid4", "darkorange"),
args.legend = list(x = "topright", bty = "n"),
main="hi I am a title III")