Taken from Chapter 5: Beckerman, Childs and Petchey: Getting Started with R

Chi-square contingency analysis

The data

We are going to analyse some data on counts of ladybirds found in an industrial and in a rural location. Some of the ladybirds are red and some are black. We would like to test for whether there is an association between the ladybird colour and its location. In particular: is some feature of the habitat associated with the frequencies of the different colourings?

We will use a Chi-Square contingency analysis to do this

Hypotheses

What do you think a suitable hypothesis should be for this investigation, and what would the corresponding null hypothesis be?

Preliminaries

  • Create an R notebook called ladybirds and save it in the scripts folder within your R project folder on your machine. You will find that it is saved as ladybirds.Rmd. Leave the yaml bit at the top, between the pairs of lines with three dashes (or maybe add your name and the date), then delete everything else.
  • Save the data file “ladybirds_morph_colour.csv” to the data folder in your R project folder, if is not already there.
  • Make sure that your Project folder is actually what RStudio recognises as a “project”. You can navigate to and around it in the Files tab in the bottom right-hand pane of RStudio. If all is good you will see a MyProjectName.Rproj file at the top level of the project. If you don’t see this, time now to turn your folder into a project!

Load packages we need

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(here)
## here() starts at /Users/michaelhunt/Library/CloudStorage/OneDrive-CornwallCollege/CCG_courses/Level 6 CCG/HonoursProject/R_workshops
library(cowplot) # this makes your plots look nicer

Import the data

filepath<-here("data","ladybirds_morph_colour.csv")
lady<-read_csv(filepath)
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Habitat, Site, morph_colour
## dbl (1): number
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check it out

glimpse(lady)
## Rows: 20
## Columns: 4
## $ Habitat      <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "Ru…
## $ Site         <chr> "R1", "R2", "R3", "R4", "R5", "R1", "R2", "R3", "R4", "R5…
## $ morph_colour <chr> "black", "black", "black", "black", "black", "red", "red"…
## $ number       <dbl> 10, 3, 4, 7, 6, 15, 18, 9, 12, 16, 32, 25, 25, 17, 16, 17…

Are those sensible names for the variables? Is the data tidy?

Calculate the totals of each colour in each habitat.

totals<- lady %>%
  group_by(Habitat,morph_colour) %>%
  summarise (total.number = sum(number))
## `summarise()` has grouped output by 'Habitat'. You can override using the
## `.groups` argument.
totals

Plot the data

totals %>%
  ggplot(aes(x = Habitat,y = total.number,fill=morph_colour))+
  geom_col(position='dodge') + #try leaving ou this line. What do you get?
  theme_cowplot() # try leaving out this line (but if you do, leave out the final '+' in the line above). What happens?

Fix the colours

Let us make the bars red and black for red and black ladybirds respectively.

totals %>%
  ggplot(aes(x = Habitat,y = total.number,fill=morph_colour))+
  geom_col(position='dodge') +
  scale_fill_manual(values=c(black='black',red='red')) +
  theme_cowplot()

Interpret the graph before we do any ‘stats’

Look at the plot - does it look as though black ladybirds are equally as common, relative to the red, in the industrial as they are in rural settings, or not? Do you expect to retain or to reject the null hypothesis?

Making the Chi-square test

Preparation

We will use the command chisq.test() for this. However, this requires a matrix of the total counts and our totals data is in one column of a data frame. All dplyr() commands work on and give back data frames. We need to convert this data frame into a 2x2 matrix. We can use the xtabs() command to do this.

lady.mat<-xtabs(number~Habitat + morph_colour, data=lady)
lady.mat
##             morph_colour
## Habitat      black red
##   Industrial   115  85
##   Rural         30  70

The argument number~Habitat + morph_colour tells the xtabs() function to look trough the lady data frame and find how many ladybirds there are for each combination of habitat and colour. So, for example, there are 115 black ladybirds in an industrial location, 70 red ones in a rural location, and so on.

The resulting matrix is sometimes called a contingency table.

The actual Chi-square test

This will calculate a ‘test statistic’ by comparing the actual counts of the ladybirds with their expected counts under the null hypothesis.

This test statistic has a so-called chi-squared distribution. The p-value returned is the probability that the statistic would be as big as it is, or bigger, if the null hypothesis were true.

This is generally how statistical tests work. They take your data and use it in a carefully chosen way to calculate some number that in general is called a test statistic but which is referred to by different names when calculated for different tests. How it is calculated depends on the test. Providing the data meet certain criteria, the statistic will typically have a probability distribution. This means that the probability that it will exceed a given value, if the null hypothesis is true, can be calculated. This probability is the p-value. If the p-value is very small, and by that we typically mean less than 0.05 or 0.01, then we can reject the null hypothesis.

A good explanation of how this works with contingency tables can be found here: http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm

How the chi-square test works:

A chi square test is appropriate provided we have count data, as we do here, where we do not have replicates ie in this case there is one count number per combination of location and colour, and where the counts are independent ie each ladybird contributes to just one of the counts, which is also the case here.

Let’s do it…

chisq.test(lady.mat)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  lady.mat
## X-squared = 19.103, df = 1, p-value = 1.239e-05

Conclusion

Study the output of the chi-square test. Note that you are given a test-statistic (here called Chi-squared), a number of degrees of freedom (in some tests you are given more than one of these). This is the number of independent pieces of information used to calculate the test statistic. Lastly, you are a given a p-value. This is the probability that you would have got a chi-squared value as big or bigger than the one you got if the null hypothesis were true. Here the null hypothesis is that there is no association between ladybird colour and location. Put another way, it is, roughly speaking, the probability of getting the data you got if the null hypothesis were true.

In our case, if there were no association of ladybird colour with location, we would expect to find roughly the same proportion of each colour in each location. 300 ladybirds were observed altogether, 200 of them in the industrial location, with about half overall being red and half black. Hence, under the null hypothesis that there is no association of ladybird colour with location we would ‘expect’ to see 100 of each colour in the industrial location and 50 of each colour in the rural location.

The chi-square test essentially calculates the likelihood that we would have observed the numbers we actually saw if that null hypothesis were true and there really were 50% each of red and black ladybirds in each of the locations.

Select which of the two following statements would be an appropriate way to report these data, and fill in the missing values.

Option 1

‘Ladybird colour morphs are not equally common in the two habitats (Chi-sq =19.3, df = 1, p<0.001)’

Option 2

‘We find insufficient evidence to reject the null hypothesis that Ladybird colour morphs are equally distributed in the two habitats (Chi-sq =, df = , p = )’