Chi-Square contingency analysis

Hypotheses

What do you think a suitable hypothesis should be for this investigation, and what would the corresponding null hypothesis be?

The null hypotheses is: Ho:
The alternate hypothesis is: H1:

Preliminaries

Create an R notebook called ladybirds and save it in the scripts folder within your R project folder on your machine. You will find that it is saved as ladybirds.Rmd. Leave the yaml bit at the top, between the pairs of lines with three dashes (or maybe add yoour name and the date), then delete everything else.
Save the data file “ladybirds_morph_colour.csv” to the data folder in your R project folder, if is not already there.
Make sure that your Project folder is actually what RStudio recognises as a “project”. You can navigate to and around it in the Files tab in the bottom right-hand pane of RStudio. If all is good you will see a MyProjectName.Rproj file at the top level of the project. If you don’t see this, time now to turn your folder into a project!

Load packages we need

library(tidyverse)
library(here)
library(cowplot) # this makes your plots look nicer
theme_set(theme_cowplot()) # this sets the cowplot theme to be the default theme.

Import the data

filepath<-here("data","ladybirds_morph_colour.csv")
lady<-read_csv(filepath)

Check it out

glimpse(lady)

Rows: 20
Columns: 4
$ Habitat      <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "Ru…
$ Site         <chr> "R1", "R2", "R3", "R4", "R5", "R1", "R2", "R3", "R4", "R5…
$ morph_colour <chr> "black", "black", "black", "black", "black", "red", "red"…
$ number       <dbl> 10, 3, 4, 7, 6, 15, 18, 9, 12, 16, 32, 25, 25, 17, 16, 17…

Are those sensible names for the variables? Is the data tidy?

Calculate the totals of each colour in each habitat.

totals<- lady %>%
  group_by(Habitat,morph_colour) %>%
  summarise (total.number = sum(number))
totals

Now that we hve these totals we use them to plot a bar chart of the data, using geom_col() withinggplot:

Plot the data

totals %>%
  ggplot(aes(x = Habitat,y = total.number,fill=morph_colour))+
  geom_col(position='dodge') + #try leaving out the position='dodge' argument. What do you get?
  theme_cowplot() # try leaving out this line (but if you do, leave out the final '+' in the line above). What happens?

Fix the colours

The fill colours we got in the figure above are defaults from R, which does not realise that a factor of interest for us is the actual colour of the ladybirds. We would like the figure to reflect that, so let us make the bars red and black for red and black ladybirds respectively.

totals %>%
  ggplot(aes(x = Habitat,y = total.number,fill=morph_colour))+
  geom_col(position='dodge') +
  scale_fill_manual(values=c(black='black',red='red'))# this line manually sets the colours for us

Interpret the graph before we do any ‘stats’

Look at the plot - does it look as though black ladybirds are equally as common, relative to the red, in the industrial as they are in rural settings, or not? Do you expect to retain or to reject the null hypothesis?

Making the Chi-square test

Preparation

We will use the command chisq.test() to carry out the chi square test. However, this requires a matrix of the total counts and our totals data is in one column of a data frame.We need to convert this data frame into a 2x2 matrix. We can use the xtabs() command to do this.

lady.mat<-xtabs(number~Habitat + morph_colour, data=lady)
lady.mat

            morph_colour
Habitat      black red
  Industrial   115  85
  Rural         30  70

This matrix is sometimes called a contingency table.

The actual Chi-square test

This will calculate a ‘test statistic’ \(X^2\) by comparing the actual counts of the ladybirds with their expected counts under the null hypothesis. The further the actual counts are from their expected values, on the whole, the bigger this test statistic will be.

Providing a number of conditions are met by the data, this test statistic \(X^2\) has a so-called chi-squared distribution. This is a known mathematical distribution, which makes it possible to calculate the probability that the statistic would be as big as it is, or bigger, if the null hypothesis were true. We call this probability the p-value.

This is generally how statistical tests work. They take your data and use it in a carefully chosen way to calculate some number that in general is called a test statistic but which is referred to by different names when calculated for particular tests. How it is calculated depends on the test. Providing the data meet certain criteria, the statistic will typically have a probability distribution. This means that the probability that it will exceed a given value, if the null hypothesis is true, can be calculated. This probability is the p-value. If the p-value is very small, and by that we typically mean less than 0.05 or 0.01, then we can reject the null hypothesis.

Let’s do it…

chisq.test(lady.mat)


    Pearson's Chi-squared test with Yates' continuity correction

data:  lady.mat
X-squared = 19.103, df = 1, p-value = 1.239e-05

Yates continuity correction

This adjusts for the fact that our data are discrete and the chi-square distribution that we are using to calculate the p-values is continuous. That’s it.

Conclusion

Study the output of the chi-square test. Note that you are given a test-statistic (here called Chi-squared/X-squared), a number of degrees of freedom (df) (in some tests you are given more than one of these). This is the number of independent pieces of information used to calculate the test statistic. Lastly, you are a given a p-value. This is the probability that you would have got a chi-squared value as big or bigger than the one you got if the null hypothesis were true. Here the null hypothesis is that there is no association between ladybird colour and location. Put another way, it is, roughly speaking the probability of getting the data you got if the null hypothesis were true.

Select which of the two following statements would be an appropriate way to report these data, and fill in the missing values.

Option 1

‘Ladybird colour morphs are not equally common in the two habitats (Chi-sq =19.3, df = 1, p<0.001)’

Option 2

‘We find insufficient evidence to reject the null hypothesis that Ladybird colour morphs are equally distributed in the two habitats (Chi-sq =, df = , p = )’

The Chi-Square Test explained (advanced)

You can skip this section if you are not interested in how the chi-square test works. If you are, read on.

Let’s recall the number of sightings of each colour ladybird in each habitat:

lady.mat

            morph_colour
Habitat      black red
  Industrial   115  85
  Rural         30  70

If there were no association between colour and habitat, then we would expect the relative proportions of colour to be independent of habitat. That is they should be the same in both rural and industrial habitats. Looking at the table above we see that two thirds (200 out of 300) of all sightings, regardless of colour, were in an industrial habitat. Hence we would expect that two thirds of all 145 black sightings would be in an industrial habitat. We this arrive at an ‘expected’ number for sightings of black butterflies in an industrial habitat to be \((200/300) \times 145 = 96.7\). Similarly, we would expect one third (100 out of 300) of all 155 red sightings to be in a rural habitat, giving an ‘expected’ number for this combination of levels to be \((100/300)\times 155 = 51.7\). More generally, the expected number in an any cell of the table, under the null hypothesis of no association between the two factors is given by

\[\text{expected number}=\frac{\text{row total}\times\text{column total}}{\text{grand total}}\] Using this method, these are the four expected numbers for each combination of levels of the two factors we have:

      [,1]   [,2]
[1,] 96.67 103.33
[2,] 48.33  51.67

To get a measurement of how far the actual table numbers are from their expected values we can, for each cell, square the difference between the expected and actual values, then divide the result by the actual value, and finally sum the four results that we get. This has the effect of giving equal weight to positive or negative deviations of the oberved values from the expected values, and scales each squared deviation so that all four have equal weight in the final sum. The result is the chi-squared test statistic:

\[ \begin{align*} X^2&=\sum_{i=1}^4\frac{(O_i-E_i)^2}{E_i}\\ &=\frac{(115-96.67)^2}{96.67} + \frac{(85-103.33)^2}{103.33} + \frac{(30-48.33)^2}{48.33} + \frac{(70-51.67)^2}{51.67}\\ &=20.189 \end{align*} \]

Yates continuity correction (digression)

You may have noticed that the \(X^2 = 20.189\) value calculated above is slightly larger than the value calculated by the chisq.test() function, which found \(X^2 = 19.096\). This is because the function uses Yates continuity correction. This was suggested by Yates in 1934 to correct for the fac that the \(X^2\) statistic we calculate is actually discrete (becuase we have categorical data) whereas the chi-square distrivution is continuous. This means that the vlue we calculate tends to be too big so that our p-values ae too small. The problem is most apparent where we have small numbers and one degree of freedom, which is what you have in a \(2 \times 2\) contigency table such s in the example above. Yates’ fix is quite simple: just amend the \(X^2\) statistic to the following:

X^2=_{i=1}^4

where the vertical lines mean ‘take the absolute value of’, so that |85-103.33| = |-18.33| = 18.33

This gives us

\[ \begin{align*} X^2&=\sum_{i=1}^4\frac{(\lvert O_i-E_i \rvert - 0.5)^2}{E_i}\\ &=\frac{(|115-96.67|-0.5)^2}{96.67} + \frac{(|85-103.33|-0.5)^2}{103.33} + \frac{(|30-48.33|-0.5)^2}{48.33} + \frac{(|70-51.67|-0.5)^2}{51.67}\\ &=19.096 \end{align*} \] which is exactly what the chisq.test() function gives.

Under a null hypothesis of no association between habitat and colour all the counts would be the ‘expected’ values, and \(X^2\) would be zero. The further away from these values the actual results are, the bigger \(X^2\) will be and the more likely it is that we can reject the null. For a sufficiently large value of \(X^2\) we will reject the null. To sum up, in general we will reject the null when \(X^2\) is large, and fail to reject it when it is small.

But how large is large enough to reject the null?

To answer this we use the fact that the sampling distribution of \(X^2\) is a chi-squared distribution with (in this case) \((2-1) \times (2-1) = 1\) degrees of freedom.

Why is it?

To answer this, let us recognise that in our observed contingency table, where we have N observations altogether, we can think of their being a probability \(P_i\) that any individual observation will end up in cell i, with the actual observed frequency in that cell, \(O_i\) being the product of this probability and the total number of observations: \(O_i = P_i\times N\). If we repeated the study again and again we would get slightly different numbers in each cell each time. The distribution of the numbers in a particular cell would follow similar rules as that of the number of heads we might get each time if we tossed a coin N times, then did the same again and again. This distribution is called a binomial distribution.

In other words, our observed frequencies have been obtained by sampling from a binomial distribution, where \(O_i \sim \text{Binomial}(P_i,N)\). Now, providing N is large enough and providing too that the probabilities \(P_i\) are not too close to 0 or 1, then the binomial distribution resembles a normal distribution. Thus, providing \(P_i\times N\), that is providing the observed frequencies \(O_i=P_i\times N\) are large enough, then the \(O_i\) will be approximately normally distributed.If this is the case, then so too is \(\frac{(O_i-E_i)}{\sqrt{E_i}}\) since the expected values \(E_i\) are fixed quantities and all this transformation does is turn our normal distribution into a standard normal distribution, with mean = 0 and standard deviation = 1. ie it shifts and squishes the distribution.

Hence in our expression for our test statistic \(X^2\) what we are doing is adding up k (= 4 in this case) squared standard normal distributions. This is the definition of a chi-squared distribution with four degrees of freedom. Thus we see that the sampling distribution of our test statistic is a chi-squared distribution.

The one final slightly odd detail is that when we run a chi-square test for a 2 x 2 contingency table we are told that there is one degree of freedom. In general, for \(m \times n\) table, there will be \((m-1)\times(n-1)\) degrees of freedom.

Chi-Square contingency analysis