Taken from Chapter 5: Beckerman, Childs and Petchey: Getting Started with R

Chi-square contingency analysis

The data

We are going to analyse some data on counts of ladybirds found in an industrial and in a rural location. Some of the ladybirds are red and some are black. We would like to test for whether there is an association between the ladybird colour and its location. In particular: is some feature of the habitat associated with the frequencies of the different colourings?

We will use a Chi-Square contingency analysis to do this

Hypotheses

The null hypotheses is: Ho:
The alternate hypothesis is: H1:

Preliminaries

Create a scripts file called ladybirds.R in your R_stuff folder on your machine
Save the data file “ladybirds_morph_colour.csv” to the data folder in your R_stuff folder, if is not already there.
Make your Rstuff folder a “Project”. This will be good practice in future. In particular, it will enable you to use the very useful here package which makes it easy for you to both organise your files and get at them from R without having to remember strange prefixes to your fienames like “./” (or should it be “../”?)

Clear the decks

rm(list=ls())

Load packages we need

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# if trying that causes you problems, uncomment and try these lines
# library(dplyr)
# library(ggplot2)
# library(readr)
library(here)

## here() starts at /Users/mbh/Documents/CCG_courses/Level_5_CCG/CORN276_GIS_RM/RStuff

Import the data

#lady<-read_csv("../data/ladybirds_morph_colour.csv")
lady<-read_csv(here("data","ladybirds_morph_colour.csv"))

## Parsed with column specification:
## cols(
##   Habitat = col_character(),
##   Site = col_character(),
##   morph_colour = col_character(),
##   number = col_double()
## )

Check it out

glimpse(lady)

## Rows: 20
## Columns: 4
## $ Habitat      <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "R…
## $ Site         <chr> "R1", "R2", "R3", "R4", "R5", "R1", "R2", "R3", "R4", "R…
## $ morph_colour <chr> "black", "black", "black", "black", "black", "red", "red…
## $ number       <dbl> 10, 3, 4, 7, 6, 15, 18, 9, 12, 16, 32, 25, 25, 17, 16, 1…

Are those sensible names for the variables? Is the data tidy?

Calculate the totals of each colour in each habitat.

totals<- lady %>%
  group_by(Habitat,morph_colour) %>%
  summarise (total.number = sum(number))

## `summarise()` regrouping output by 'Habitat' (override with `.groups` argument)

totals

Plot the data

g<-ggplot(totals, aes(x = Habitat,y = total.number,fill=morph_colour))+
  geom_bar(stat='identity',position='dodge')
g

Fix the colours

Let us make the bars red and black for red and black ladybirds repectively.

g<-ggplot(totals, aes(x = Habitat,y = total.number,fill=morph_colour))+
  geom_bar(stat='identity',position='dodge')+
  scale_fill_manual(values=c(black='black',red='red'))
g

Interpret the graph before we do any ‘stats’

Look at the plot - does it look as though black ladybirds are equally as common, relative to the red, in the industrial as they are in rural settings, or not? Do you expect to retain or to reject the null hypothesis?

Making the Chi-square test

Preparation

We will use the command chisq.test() for this. However, this requires a matrix of the total counts and our totals data is in one column of a data frame. All dplyr() commands work on and give back data frames. We need to convert this data frame into a 2x2 matrix. We can use the xtabs() command to do this.

lady.mat<-xtabs(number~Habitat + morph_colour,data=lady)
lady.mat

##             morph_colour
## Habitat      black red
##   Industrial   115  85
##   Rural         30  70

This matrix is sometimes called a contingency table.

The actual Chi-square test

This will calculate a ‘test statistic’ by comparing the actual counts of the ladybirds with their expected counts under null hypothesis. This test statistic has a chi-squared distribution. The p-value returned is the probability that the statistic would be as big as it is, or bigger, if the null hypothesis were true.

A good explanation of how this works with contingency tables can be found here: http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm

Let’s do it…

chisq.test(lady.mat)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  lady.mat
## X-squared = 19.103, df = 1, p-value = 1.239e-05

Conclusion

Study the output of the chi-square test.

Select which of the two following statements would be an appropriate way to report these data, and fill in the missing values.

Option 1

‘Ladybird colour morphs are not equally common in the two habitats (Chi-sq =, df = , p = )’

Option 2

‘We find insufficient evidence to reject the null hypothesis that Ladybird colour morphs are equally distributed in the two habitats (Chi-sq =, df = , p = )’