2024-03-18

Chi Square Test for Independence

The Chi Square test is a test used commonly in statistics for determining whether two variables are independent of each other or not.This test has many applications but is most frequently used in biology to determine whether two species are associated.

Hypothesis Testing

  • Null hypothesis: There exists no significant association between the variables. These variables are independent of each other.
  • Alternative hypothesis: There exists a significant relationship, whether positive or negative, between these variables. These variables are not independent of each other.

Calculating Chi Square Test of Independence for Two-Species Association

  • Species presence/absence data is arranged into a table
  • This is done by creating a 2x2 table according to the species presence/absence, with values representing the number of sites where a species is found present
  • Example 1: Species A and Species B
##          spb
## spa       Present Absent
##   Present      19      5
##   Absent        4      7

Expected Distribution Calculations

  • First, find the expected distributions of each species by multiplying the each column total by each row total and then dividing by the total amount of observations
  • [1,1] = \(((19+4)*(19+5))/35 = 15.77\)
  • [1,2] = \(((5+7)*(19+5))/35 = 8.23\)
  • [2,1] = \(((19+4)*(4+7))/35 = 7.23\)
  • [2,2] = \(((5+7)*(4+7))/35 =3.77\)
##       [,1] [,2]
## [1,] 15.77 7.23
## [2,]  8.23 3.77

Chi Residuals Calculations

  • Second, calculate the square of the difference between the expected and observed values and then divide by the expected values. Add all of these residuals to find the chi square
  • [1,1] = \((19-15.77)^2/15.77 = 0.662\)
  • [1,2] = \((5-7.23)^2/7.23 = 0.688\)
  • [2,1] = \(((4-8.23)^2/8.23 = 2.174\)
  • [2,2] = \((7-3.77)^2/3.77 = 2.767\)
  • \(X^2=(0.662+0.688+2.174+2.767) = 6.291\)

P-Value and Hypothesis Rejection

  • For a chi square value of 6.291 and one degree of freedom, the p-value is calculated to be 0.01326
  • 0.01326 < 0.05 so we reject the null hypothesis, meaning there is an association between Species A and Species B

Species Presence

Example 2: Canopy Cover and Wildflower Cover

  • In a forest, wildflowers grow on the forest floor. Because all plants need sunlight, it can be hypothesized that canopy cover by trees can have an association with wildflower cover on the forest floor. Using a dataset of wildflower cover and canopy cover at 45 sites, we can test if there is an association between canopy cover and wildflower cover.
##          wildflower
## canopy    None Partial Full
##   None       1       7    6
##   Partial    5       8    4
##   Full      13       1    0

Example 2 (continued)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  cw
## X-squared = 23.682, df = NA, p-value = 0.0004998
  • With a Chi Square of 23.682 and four degrees of freedom, the p-value is much smaller than 0.05, leading us to reject the null hypothesis. There is an association between canopy cover and wildflower cover.

Canopy Cover and Wildflower Cover

Canopy Cover, Soil Quality, Flower Species, Wildflower Cover

With additional data on soil quality and flower species it is possible to create a visualization of the conditions in which wildflowers are found. Opacity has been lowered so darker points indicate greater frequencies.

Code for 3D Scatterplot

library(plotly)
visual3 <- data.frame(canopy, wildflower, soil, species)
ex3 <- plot_ly(visual3, type="scatter3d", x=~soil, y=~canopy, 
      z=~wildflower, color=~species,      
      colors=c('#9172EC','#FFE078'),opacity=0.3, mode="markers")
ex3 <- ex3 %>% layout(scene = 
      list(xaxis = list(title = 'Soil Quality'),
      yaxis = list(title = 'Canopy Cover'),
      zaxis = list(title = 'Wildflower Cover')))
ex3