Markdown Author: Jessie Bell, 2023

Libraries Used: none

Part 1: Chi-squared Goodness-of-Fit Test

Null and Alternative Hypotheses The chi-squared goodness-of-fit test allows researchers to determine whether samples from a population are likely, given their expected proportions (which are converted to frequencies). The statistical

hypotheses are stated as:

• \(H_0\): The proportion of each group from the sampled population equals the expected proportions for that group

• \(H_A\): The proportion of each group from the sampled population does not equal the expected proportions for that group

1

Make a frequency table of categorical variables. You make up the data yourself. Share your table and R code that you used to make it in your write up.

color.data <- c("red", "blue", "red", "red", "orange", "blue", "red", "orange
")

(color.data2 <- c(red = 4, blue = 2, orange = 2))

##    red   blue orange 
##      4      2      2

color.data2 <- data.frame(c("cyan" = 4, "purple" = 2, "orange" = 10))

color.data3 <- table(color.data)

color.data3

## color.data
##     blue   orange orange\n      red 
##        2        1        1        4

2

# Sea star data
obs.freqs <- c(purple = 154, orange = 54)
exp.props <- c(purple = 0.75, orange = 0.25)
exp.freqs <- sum(obs.freqs) * exp.props
chisq.val <- sum(((obs.freqs - exp.freqs)^2) / exp.freqs)

# Displaying the results
exp.freqs

## purple orange 
##    156     52

chisq.val

## [1] 0.1025641

3

# Plotting chi-squared distributions
curve(dchisq(x, df = 4), 0, 20, ylab = "Density")
curve(dchisq(x, df = 5), add = TRUE, col = "steelblue")
curve(dchisq(x, df = 6), add = TRUE, col = "steelblue1")

4

Share your work for the hypothesis test. Include stating Null and Alternative Hypotheses, calculating the test statistic, calculating the critical value, and making a comparison.

#manaully
qchisq(0.95, df = 1)

## [1] 3.841459

#Remember you want the right tail!
1- pchisq(chisq.val, df = 1)

## [1] 0.748774

# Running chi-squared test 
result <- chisq.test(obs.freqs, p = exp.props, correct = FALSE)

# Displaying the results
result

## 
##  Chi-squared test for given probabilities
## 
## data:  obs.freqs
## X-squared = 0.10256, df = 1, p-value = 0.7488

#p-value = 0.7488 which checks out!
#df = 1

curve(dchisq(x, df = 1), 0, 20, ylab = "Density")
#Add chi-squared value to graph
abline(v = chisq.val, col = "firebrick4")
text(chisq.val, 0.1, "Test stat", pos = 4, col = "firebrick4")
#Add critical value to graph
abline(v = qchisq(0.95, df = 1), col = "steelblue")
text(qchisq(0.95, df = 1), 0.8, "Critical value", pos = 4, col = "steelblue")

5

xVals <- barplot(obs.freqs, ylim = c(0, 220), ylab = "Frequency", col=obs.freqs)
#xVals now have the center of each bar
xVals

##      [,1]
## [1,]  0.7
## [2,]  1.9

#I now play around with the values to get the lines right
lines(c(xVals[1]-0.6, xVals[1]+0.6), rep(exp.freqs[1], 2), lty = 5)
lines(c(xVals[2]-0.6, xVals[2]+0.6), rep(exp.freqs[2], 2), lty = 5)
text(200, labels="--- expected frequencies")

Part 2: Chi-squared Tests for Independence

• \(H_0\): The two variables are independent

• \(H_A\): The two variables are not independent

herbivore <- c("Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", "Yes", "No",
"No", "No", "Yes")
phenolics <- c("No", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No",
"Yes", "No", "Yes")
#You can combine the data into a data.frame
(myData <- data.frame(herbivore, phenolics))

##    herbivore phenolics
## 1        Yes        No
## 2        Yes       Yes
## 3         No        No
## 4        Yes        No
## 5        Yes       Yes
## 6        Yes       Yes
## 7         No        No
## 8        Yes       Yes
## 9         No        No
## 10        No       Yes
## 11        No        No
## 12       Yes       Yes

write.csv(myData, "inducible_defense_data.csv")
(conTable <- table(myData))

##          phenolics
## herbivore No Yes
##       No   4   1
##       Yes  2   5

6

(conTable2 <- matrix(c(4,1,2,5), 2, byrow = T, dimnames = list(herbivores =
c("No", "Yes"), phenolics = c("No", "Yes"))))

##           phenolics
## herbivores No Yes
##        No   4   1
##        Yes  2   5

Example

Color polymorphism in sea stars

Let’s continue to analyze our sea star data from Harley et al (2006). They measured the frequencies of the two common color morphs at many sites, not just Strawberry Hill. In fact, they sample Hat Island, which is a small island just south of Bellingham at the mouth of Padilla Bay. At Hat Island they sampled 142 individuals and all but one was purple! So, let’s ask whether the color is independent of site. What is your guess given the data? Does it seem extreme? Let’s enter the data again manually:

#Hat Island data
HI <- c(purple = 141, orange = 1)
#Strawberry Hill data
SH <- c(purple = 154, orange = 54)
(obsFreqs <- matrix(c(HI, SH), 2,
 dimnames = list(color = c("Purple", "Orange"), site = c("Hat Island",
"Strawberry Hill"))))

##         site
## color    Hat Island Strawberry Hill
##   Purple        141             154
##   Orange          1              54

Manually

Unlike the Chi-squared goodness-of-fit test, the Chi-squared test for independence requires that we calculate the expected values from the observed data. We do this with something called a contingency table. There is a video online that goes through this process with you by hand. The following exercise will R functions to calculate the contingency table. Expected Values So let’s calculate the expected values from the contingency table. An easy way to calculate the row, column, and total sums is with the function addmargins(). You can then extract the column and row totals and calculate the column and row probabilities— extract the correct values by using hard brackets and divide by the total number of observations. The function outer() will then calculate the all the shared probabilities and then just multiple this matrix by the total number of observations. For a small data set like this, by hand might be quicker. However, for a large data set, using the functions in R will save you a lot of time. Remember the expected values are easily calculated by multiplying the row total and column total of interest and then dividing by the grand total.

(obsFreqsMar <- addmargins(obsFreqs))

##         site
## color    Hat Island Strawberry Hill Sum
##   Purple        141             154 295
##   Orange          1              54  55
##   Sum           142             208 350

#Order matters, do rows and then columns for this example
(expFreqs <- outer(obsFreqsMar[1:2,3], obsFreqsMar[3,1:2])/obsFreqsMar[3,3])

##        Hat Island Strawberry Hill
## Purple  119.68571       175.31429
## Orange   22.31429        32.68571

#Recommend double checking your answers

We can now use the equation for the Chi-squared test statistic to to calculate the Chi-squared value.

\(\chi^2\) = \(\sum^g_i\)\(_=\)\(_1\) \((0_i-E_i)^2/E_i\)

where O are the observed values, E are the expected values, and r is the number of row, c is the number of columns.

R makes this calculation very easy because it pairs up the values in a given cell. For example, when you add two matrices, it will add the value in the upper-left in one matrix to the value in the upper-left of the other matrix.

Chi-squared Distribution and P value

We looked at the Chi-squared distribution in the Chi-squared goodness-of-fit test webpage. It is exactly the same distribution for both the Chi-squared. To calculate the degrees of freedom for both test, multiple the number of rows - 1 by the number of columns - 1.

\(df = (r-1)*(c-1)\)

So, for our example, we have 2 rows and 2 columns, and thus the degrees of freedom are 1. We can now calculate the P-values for both of our test statistics (the Chi-squared value).

(chiVal <- sum(((obsFreqs-expFreqs))^2/expFreqs))

## [1] 40.6452

pChi <- 1 - pchisq(chiVal, 1)

curve(dchisq(x, 1), 0, 60, ylab = "Density")
abline(v = c(chiVal, qchisq(0.95, 1)), col = "red")
text(c(chiVal, qchisq(0.95, 1)), 0.3, c("Chi-squared value", "Critical value"
), col = "red")

7

Share your work for this hypothesis

State Null and Alternative Hypotheses

\(H_0\): The proportion of each group from the sampled population equals the expected proportions for that group

\(H_A\): The proportion of each group from the sampled population does not equal the expected proportions for that group

Calculate the test statistic

# Calculating expected frequencies
exp.freqs <- sum(obs.freqs) * exp.props

# Calculating the chi-squared test statistic
chi_squared_values <- ((obs.freqs - exp.freqs)^2) / exp.freqs
chi_squared_statistic <- sum(chi_squared_values)


chi_squared_statistic

## [1] 0.1025641

Calculate the critical value

# Degrees of freedom for the chi-squared test
df <- length(obs.freqs) - 1

# Significance level (alpha)
alpha <- 0.05

# Calculating the critical value
critical_value <- qchisq(1 - alpha, df)

critical_value

## [1] 3.841459

Compare. What do you conclude?

8

Share your work for this quick hypothesis test. How does the result compare with the “manual” version of running the test?

chisq.test(obsFreqs, correct = F)

## 
##  Pearson's Chi-squared test
## 
## data:  obsFreqs
## X-squared = 40.645, df = 1, p-value = 1.825e-10

9

Fisher’s Test

fisher.test(obsFreqs)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  obsFreqs
## p-value = 7.659e-13
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##     8.215059 1982.702767
## sample estimates:
## odds ratio 
##   49.08337

Part 3: Presenting Results

10

mosaicData <- obsFreqs/matrix(c(142, 208, 142, 208), 2, byrow = T)
#Rev the matrix so site is on the x axis
mosaicData2 <- t(mosaicData)
mosaicplot(mosaicData2,
 main = "hi I am a title I",
 col = c("darkorchid4", "orange"),
 ylab = "Color",
 xlab = "Site")

opar <- par()
par(mar = c(4, 4, 1, 8) + 0.1)
barplot(obsFreqs/matrix(c(142, 208, 142, 208), 2, byrow = T),
 legend = T,
 width = c(142, 208),
 ylab = "Relative Frequency",
 col = c("darkorchid4", "darkorange"),
 args.legend = list(x = "topright", bty = "n", inset = c(-0.3, 0.4)), 
 main="hi I am a title II")

barplot(obsFreqs/matrix(c(142, 208, 142, 208), 2, byrow = T),
 beside = T,
 legend = T,
 ylim = c(0, 1),
 ylab = "Relative Frequency",
 col = c("darkorchid4", "darkorange"),
 args.legend = list(x = "topright", bty = "n"), 
 main="hi I am a title III")

Lab 08: Chi-squared Tests