The following example covers Example 8.6 (Mass extinctions) in Whitlock & Schluter’s “Analysis of Biological Data” using the package dplyr and tidyr. To see how to work through this example using base R, click here.

Load the data

extinctData <- read_csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e6MassExtinctions.csv")

## Parsed with column specification:
## cols(
##   numberOfExtinctions = col_double()
## )

glimpse(extinctData)

## Rows: 76
## Columns: 1
## $ numberOfExtinctions <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…

Create table

The following commands create the table. Note that we are explicitly converting numberOfExtinctions to a factor - why we do this will become clearer in the pipeline after this one.

extinctTable <- extinctData %>% 
  mutate(nExtinctFactor = factor(numberOfExtinctions, levels=0:20)) %>%
  group_by(nExtinctFactor) %>%
  summarize(obs = n())

## `summarise()` ungrouping output (override with `.groups` argument)

glimpse(extinctTable)

## Rows: 14
## Columns: 2
## $ nExtinctFactor <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 20
## $ obs            <int> 13, 15, 16, 7, 10, 4, 2, 1, 2, 1, 1, 1, 2, 1

Fill in gaps

Looking at the table, the problem we have is the non-existence of levels that have no counts, such as 0, 12, and 13. We will use the complete function from the tidyr package. The fill argument tells complete what to put for specified variables (rather than NA) for those levels that get filled in.

extinctTable <- extinctTable %>% 
  complete(nExtinctFactor, 
           fill = list(obs = 0))
glimpse(extinctTable)

## Rows: 21
## Columns: 2
## $ nExtinctFactor <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ obs            <dbl> 0, 13, 15, 16, 7, 10, 4, 2, 1, 2, 1, 1, 0, 0, 1, 0, 2,…

That’s better. We don’t need the factor anymore (it will be replaced later anyways), so the next code will remove the factor and replace it with a numeric variable denoting number of extinctions.

extinctTable <- extinctTable %>%
  mutate(nExtinct = 0:20) %>%
  select(-nExtinctFactor)
glimpse(extinctTable)

## Rows: 21
## Columns: 2
## $ obs      <dbl> 0, 13, 15, 16, 7, 10, 4, 2, 1, 2, 1, 1, 0, 0, 1, 0, 2, 0, 0,…
## $ nExtinct <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…

Calculate summary statistics

We need to grab some summary statistics from the table, in particular the total number of observations in our data, as well as the average number of extinctions (note that we can’t use the mean function as this is tabulated data, not raw data!) This would be a good time to think about how you would calculate averages from tabulated data - useful skill for when papers provide tables but no raw data.

summaryTable <- extinctTable %>%
  summarize(total = sum(obs),
            mean = sum(nExtinct*obs)/total)
glimpse(summaryTable)

## Rows: 1
## Columns: 2
## $ total <dbl> 76
## $ mean  <dbl> 4.210526

Compute expected frequencies

Okay, now let’s create the expected frequencies using the dpois function. Note that the Poisson distribution is parameterized by \(\lambda\), which is the average number of extinctions we just calculated. Also, we need the table to be absolute frequencies, not relative frequencies, so we multiply the probabilities by the total we just calculated in the summaryTable.

extinctTable <- extinctTable %>% 
  mutate(exp = summaryTable$total*dpois(0:20, lambda = summaryTable$mean))
print(extinctTable)

## # A tibble: 21 x 3
##      obs nExtinct   exp
##    <dbl>    <int> <dbl>
##  1     0        0  1.13
##  2    13        1  4.75
##  3    15        2 10.0 
##  4    16        3 14.0 
##  5     7        4 14.8 
##  6    10        5 12.4 
##  7     4        6  8.73
##  8     2        7  5.25
##  9     1        8  2.76
## 10     2        9  1.29
## # … with 11 more rows

Aggregate groups to satisfy assumptions

Looking at the expected frequencies, we notice that many of the frequencies are less than one, and some of them are less than five. Following the example given here, we will group 0 and 1 extinctions together, and 8 or more extinctions together. We will use the mutate command and utilize the case_when function to create our aggregated groups. Note the need for as.character to make sure the groups variable has type character. Then, we sum the probabilities within those aggregated groups.

extinctTable <- extinctTable %>% 
  mutate(groups = case_when(
      nExtinct <= 1 ~ "0 or 1",
      nExtinct >= 8 ~ "8 or more",
      TRUE ~ as.character(nExtinct)
    )
  ) %>%
  group_by(groups) %>%
  summarize(obs = sum(obs),
            exp = sum(exp))

## `summarise()` ungrouping output (override with `.groups` argument)

glimpse(extinctTable)

## Rows: 8
## Columns: 3
## $ groups <chr> "0 or 1", "2", "3", "4", "5", "6", "7", "8 or more"
## $ obs    <dbl> 13, 15, 16, 7, 10, 4, 2, 9
## $ exp    <dbl> 5.876068, 9.996501, 14.030177, 14.768608, 12.436722, 8.727524,…

Add in lost probability

The Poisson distribution goes to infinity (and beyond!), which we’ve effectively ignored as our data only went to 20. To see this, take a look at the sums of the observed and expected at this stage.

extinctTable %>%
  summarize(sum(obs), sum(exp))

## # A tibble: 1 x 2
##   `sum(obs)` `sum(exp)`
##        <dbl>      <dbl>
## 1         76       76.0

Hmmm, that’s weird - why is there a .0 for the sum of the expected? Printing tibbles will round at times, so let’s look at the deviation from 76 directly:

summaryTable$total - sum(extinctTable$exp)

## [1] 3.516995e-07

Ah, there we go. REALLY close to 76 but not quite. So, we need to add this missed probability in the tail of the Poisson distribution back in. We do that now by adding the missing frequencies to the last group.

extinctTable[8, "exp"] <- extinctTable[8, "exp"] + (summaryTable$total - sum(extinctTable$exp))
glimpse(extinctTable)

## Rows: 8
## Columns: 3
## $ groups <chr> "0 or 1", "2", "3", "4", "5", "6", "7", "8 or more"
## $ obs    <dbl> 13, 15, 16, 7, 10, 4, 2, 9
## $ exp    <dbl> 5.876068, 9.996501, 14.030177, 14.768608, 12.436722, 8.727524,…

Let’s do some quick data validation to make sure the sum of the observed and expected frequencies are now the same (this time subtracting from the total to see if we get zero):

extinctTable %>%
  summarize(summaryTable$total - sum(obs), summaryTable$total - sum(exp))

## # A tibble: 1 x 2
##   `summaryTable$total - sum(obs)` `summaryTable$total - sum(exp)`
##                             <dbl>                           <dbl>
## 1                               0                               0

Yay! Okay, we’re ready for the chi-squared goodness-of-fit test.

Perform chi-squared goodness-of-fit test

Finally, we perform the \(\chi^2\) goodness-of-fit test.

saveChiTest <- chisq.test(extinctTable$obs, p = extinctTable$exp/summaryTable$total)

## Warning in chisq.test(extinctTable$obs, p = extinctTable$exp/
## summaryTable$total): Chi-squared approximation may be incorrect

Note that the chisq.test function assumes the degrees of freedom are number of categories minus one, but we lost one more degree of freedom since we had to estimate the mean of the Poisson distribution from the data. Since our number of categories is 8, we have a degree of freedom given by \(df = 8 - 1 - 1 = 6\). Thus, we will calculate the \(P\)-value manually using pchisq, but use the already calculated test statistic from the chisq.test function (which is why we saved the result in saveChiTest).

pValue <- pchisq(q = saveChiTest$statistic, 
                 df = 6,
                 lower.tail = FALSE)
unname(pValue)

## [1] 0.0005334919

So what’s our conclusion? We reject the null hypothesis that extinctions in the fossil record follow a Poisson distribution.

Mass extinctions with dplyr and tidyr

M. Drew LaMar

September 29, 2020