Taken from Chapter 5: Beckerman, Childs and Petchey: Getting Started with R
We are going to analyse some data on counts of ladybirds found in an industrial and in a rural location. Some of the ladybirds are red and some are black. We would like to test for whether there is an association between the ladybird colour and its location. In particular: is some feature of the habitat associated with the frequencies of the different colourings?
We will use a Chi-Square contingency analysis to do this
here package which makes it easy for you to both organise your files and get at them from R without having to remember strange prefixes to your fienames like “./” (or should it be “../”?)rm(list=ls())
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# if trying that causes you problems, uncomment and try these lines
# library(dplyr)
# library(ggplot2)
# library(readr)
library(here)
## here() starts at /Users/mbh/Documents/CCG_courses/Level_5_CCG/CORN276_GIS_RM/RStuff
#lady<-read_csv("../data/ladybirds_morph_colour.csv")
lady<-read_csv(here("data","ladybirds_morph_colour.csv"))
## Parsed with column specification:
## cols(
## Habitat = col_character(),
## Site = col_character(),
## morph_colour = col_character(),
## number = col_double()
## )
glimpse(lady)
## Rows: 20
## Columns: 4
## $ Habitat <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "R…
## $ Site <chr> "R1", "R2", "R3", "R4", "R5", "R1", "R2", "R3", "R4", "R…
## $ morph_colour <chr> "black", "black", "black", "black", "black", "red", "red…
## $ number <dbl> 10, 3, 4, 7, 6, 15, 18, 9, 12, 16, 32, 25, 25, 17, 16, 1…
Are those sensible names for the variables? Is the data tidy?
totals<- lady %>%
group_by(Habitat,morph_colour) %>%
summarise (total.number = sum(number))
## `summarise()` regrouping output by 'Habitat' (override with `.groups` argument)
totals
g<-ggplot(totals, aes(x = Habitat,y = total.number,fill=morph_colour))+
geom_bar(stat='identity',position='dodge')
g
Let us make the bars red and black for red and black ladybirds repectively.
g<-ggplot(totals, aes(x = Habitat,y = total.number,fill=morph_colour))+
geom_bar(stat='identity',position='dodge')+
scale_fill_manual(values=c(black='black',red='red'))
g
Look at the plot - does it look as though black ladybirds are equally as common, relative to the red, in the industrial as they are in rural settings, or not? Do you expect to retain or to reject the null hypothesis?
We will use the command chisq.test() for this. However, this requires a matrix of the total counts and our totals data is in one column of a data frame. All dplyr() commands work on and give back data frames. We need to convert this data frame into a 2x2 matrix. We can use the xtabs() command to do this.
lady.mat<-xtabs(number~Habitat + morph_colour,data=lady)
lady.mat
## morph_colour
## Habitat black red
## Industrial 115 85
## Rural 30 70
This matrix is sometimes called a contingency table.
This will calculate a ‘test statistic’ by comparing the actual counts of the ladybirds with their expected counts under null hypothesis. This test statistic has a chi-squared distribution. The p-value returned is the probability that the statistic would be as big as it is, or bigger, if the null hypothesis were true.
A good explanation of how this works with contingency tables can be found here: http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm
Let’s do it…
chisq.test(lady.mat)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: lady.mat
## X-squared = 19.103, df = 1, p-value = 1.239e-05
Study the output of the chi-square test.
Select which of the two following statements would be an appropriate way to report these data, and fill in the missing values.
‘Ladybird colour morphs are not equally common in the two habitats (Chi-sq =, df = , p = )’
‘We find insufficient evidence to reject the null hypothesis that Ladybird colour morphs are equally distributed in the two habitats (Chi-sq =, df = , p = )’