Taken from Chapter 5: Beckerman, Childs and Petchey: Getting Started with R
We are going to analyse some data on counts of ladybirds found in an industrial and in a rural location. Some of the ladybirds are red and some are black. We would like to test for whether there is an association between the ladybird colour and its location. In particular: is some feature of the habitat associated with the frequencies of the different colourings?
We will use a Chi-Square contingency analysis to do this
What do you think a suitable hypothesis should be for this investigation, and what would the corresponding null hypothesis be?
ladybirds and save it in
the scripts folder within your R project folder on your machine. You
will find that it is saved as ladybirds.Rmd. Leave the
yaml bit at the top, between the pairs of lines with three
dashes (or maybe add your name and the date), then delete everything
else.library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(here)
## here() starts at /Users/michaelhunt/Library/CloudStorage/OneDrive-CornwallCollege/CCG_courses/Level 6 CCG/HonoursProject/R_workshops
library(cowplot) # this makes your plots look nicer
filepath<-here("data","ladybirds_morph_colour.csv")
lady<-read_csv(filepath)
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Habitat, Site, morph_colour
## dbl (1): number
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(lady)
## Rows: 20
## Columns: 4
## $ Habitat <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "Ru…
## $ Site <chr> "R1", "R2", "R3", "R4", "R5", "R1", "R2", "R3", "R4", "R5…
## $ morph_colour <chr> "black", "black", "black", "black", "black", "red", "red"…
## $ number <dbl> 10, 3, 4, 7, 6, 15, 18, 9, 12, 16, 32, 25, 25, 17, 16, 17…
Are those sensible names for the variables? Is the data tidy?
totals<- lady %>%
group_by(Habitat,morph_colour) %>%
summarise (total.number = sum(number))
## `summarise()` has grouped output by 'Habitat'. You can override using the
## `.groups` argument.
totals
totals %>%
ggplot(aes(x = Habitat,y = total.number,fill=morph_colour))+
geom_col(position='dodge') + #try leaving ou this line. What do you get?
theme_cowplot() # try leaving out this line (but if you do, leave out the final '+' in the line above). What happens?
Let us make the bars red and black for red and black ladybirds respectively.
totals %>%
ggplot(aes(x = Habitat,y = total.number,fill=morph_colour))+
geom_col(position='dodge') +
scale_fill_manual(values=c(black='black',red='red')) +
theme_cowplot()
Look at the plot - does it look as though black ladybirds are equally as common, relative to the red, in the industrial as they are in rural settings, or not? Do you expect to retain or to reject the null hypothesis?
We will use the command chisq.test() for this. However,
this requires a matrix of the total counts and our totals data is in one
column of a data frame. All dplyr() commands work on and
give back data frames. We need to convert this data frame into a 2x2
matrix. We can use the xtabs() command to do this.
lady.mat<-xtabs(number~Habitat + morph_colour, data=lady)
lady.mat
## morph_colour
## Habitat black red
## Industrial 115 85
## Rural 30 70
The argument number~Habitat + morph_colour tells the
xtabs() function to look trough the lady data
frame and find how many ladybirds there are for each combination of
habitat and colour. So, for example, there are 115 black ladybirds in an
industrial location, 70 red ones in a rural location, and so on.
The resulting matrix is sometimes called a contingency table.
This will calculate a ‘test statistic’ by comparing the actual counts of the ladybirds with their expected counts under the null hypothesis.
This test statistic has a so-called chi-squared distribution. The p-value returned is the probability that the statistic would be as big as it is, or bigger, if the null hypothesis were true.
This is generally how statistical tests work. They take your data and use it in a carefully chosen way to calculate some number that in general is called a test statistic but which is referred to by different names when calculated for different tests. How it is calculated depends on the test. Providing the data meet certain criteria, the statistic will typically have a probability distribution. This means that the probability that it will exceed a given value, if the null hypothesis is true, can be calculated. This probability is the p-value. If the p-value is very small, and by that we typically mean less than 0.05 or 0.01, then we can reject the null hypothesis.
A good explanation of how this works with contingency tables can be found here: http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm
How the chi-square test works:
A chi square test is appropriate provided we have count data, as we do
here, where we do not have replicates ie in this case there is one count
number per combination of location and colour, and where the counts are
independent ie each ladybird contributes to just one of the counts,
which is also the case here.
Let’s do it…
chisq.test(lady.mat)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: lady.mat
## X-squared = 19.103, df = 1, p-value = 1.239e-05
Study the output of the chi-square test. Note that you are given a test-statistic (here called Chi-squared), a number of degrees of freedom (in some tests you are given more than one of these). This is the number of independent pieces of information used to calculate the test statistic. Lastly, you are a given a p-value. This is the probability that you would have got a chi-squared value as big or bigger than the one you got if the null hypothesis were true. Here the null hypothesis is that there is no association between ladybird colour and location. Put another way, it is, roughly speaking, the probability of getting the data you got if the null hypothesis were true.
In our case, if there were no association of ladybird colour with location, we would expect to find roughly the same proportion of each colour in each location. 300 ladybirds were observed altogether, 200 of them in the industrial location, with about half overall being red and half black. Hence, under the null hypothesis that there is no association of ladybird colour with location we would ‘expect’ to see 100 of each colour in the industrial location and 50 of each colour in the rural location.
The chi-square test essentially calculates the likelihood that we would have observed the numbers we actually saw if that null hypothesis were true and there really were 50% each of red and black ladybirds in each of the locations.
Select which of the two following statements would be an appropriate way to report these data, and fill in the missing values.
‘Ladybird colour morphs are not equally common in the two habitats (Chi-sq =19.3, df = 1, p<0.001)’
‘We find insufficient evidence to reject the null hypothesis that Ladybird colour morphs are equally distributed in the two habitats (Chi-sq =, df = , p = )’