The following example covers Example 8.6 (Mass extinctions) in Whitlock & Schluter’s “Analysis of Biological Data” using the package dplyr and tidyr. To see how to work through this example using base R, click here.

Load the data

read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e6MassExtinctions.csv") %>% 
  tbl_df() -> 
  extinctData
glimpse(extinctData)

## Observations: 76
## Variables: 1
## $ numberOfExtinctions <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...

Create table

The following commands create the table.

extinctData %>% 
  mutate(nExtinctFactor = factor(numberOfExtinctions, levels=0:20)) %>%
  group_by(nExtinctFactor) %>%
  summarize(obs = n()) ->
  extinctTable
glimpse(extinctTable)

## Observations: 14
## Variables: 2
## $ nExtinctFactor <fctr> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 20
## $ obs            <int> 13, 15, 16, 7, 10, 4, 2, 1, 2, 1, 1, 1, 2, 1

Fill in gaps

Looking at the table, the problem we have is the non-existence of levels that have no counts, such as 0, 12, and 13. We will use the complete function from the tidyr package. The fill argument tells complete what to put for specified variables (rather than NA) for those levels that get filled in.

extinctTable %>% 
  complete(nExtinctFactor, 
           fill = list(obs = 0)) ->
  extinctTable
glimpse(extinctTable)

## Observations: 21
## Variables: 2
## $ nExtinctFactor <fctr> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...
## $ obs            <dbl> 0, 13, 15, 16, 7, 10, 4, 2, 1, 2, 1, 1, 0, 0, 1...

That’s better. We don’t need the factor anymore (it will be replaced later anyways), so the next code will remove the factor and replace it with a numeric variable denoting number of extinctions.

extinctTable %>%
  mutate(nExtinct = 0:20) %>%
  select(-nExtinctFactor) ->
  extinctTable
glimpse(extinctTable)

## Observations: 21
## Variables: 2
## $ obs      <dbl> 0, 13, 15, 16, 7, 10, 4, 2, 1, 2, 1, 1, 0, 0, 1, 0, 2...
## $ nExtinct <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...

Calculate summary statistics

We need to grab some summary statistics from the table, in particular the total number of observations in our data, as well as the average number of extinctions (note that we can’t use the mean function as this is tabulated data, not raw data!) This would be a good time to think about how you would calculate averages from tabulated data - useful skill for when papers provide tables but no raw data.

extinctTable %>%
  summarize(total = sum(obs),
            mean = sum(nExtinct*obs)/total) ->
  summaryTable
glimpse(summaryTable)

## Observations: 1
## Variables: 2
## $ total <dbl> 76
## $ mean  <dbl> 4.210526

Compute expected frequencies

Okay, now let’s create the expected frequencies using the dpois function. Note that the Poisson distribution is parameterized by \(\lambda\), which is the average number of extinctions we just calculated. Also, we need the table to be absolute frequencies, not relative frequencies, so we multiply the probabilities by the total we just calculated in the summaryTable.

extinctTable %>% 
  mutate(exp = summaryTable$total*dpois(0:20, lambda = summaryTable$mean)) ->
  extinctTable
print(extinctTable)

## # A tibble: 21 × 3
##      obs nExtinct       exp
##    <dbl>    <int>     <dbl>
## 1      0        0  1.127730
## 2     13        1  4.748338
## 3     15        2  9.996501
## 4     16        3 14.030177
## 5      7        4 14.768608
## 6     10        5 12.436722
## 7      4        6  8.727524
## 8      2        7  5.249639
## 9      1        8  2.762968
## 10     2        9  1.292616
## # ... with 11 more rows

Aggregate groups to satisfy assumptions

Looking at the expected frequencies, we notice that many of the frequencies are less than one, and some of them are less than five. Following the example given here, we will group 0 and 1 extinctions together, and 8 or more extinctions together. We will use the mutate command and utilize the ifelse function to create our aggregated groups. Then, we sum the probabilities within those aggregated groups.

extinctTable %>% 
  mutate(groups = ifelse(nExtinct <= 1, "0 or 1",
                         ifelse(nExtinct >= 8, "8 or more", 
                                nExtinct))) %>%
  group_by(groups) %>%
  summarize(obs = sum(obs),
            exp = sum(exp)) ->
  extinctTable
glimpse(extinctTable)

## Observations: 8
## Variables: 3
## $ groups <chr> "0 or 1", "2", "3", "4", "5", "6", "7", "8 or more"
## $ obs    <dbl> 13, 15, 16, 7, 10, 4, 2, 9
## $ exp    <dbl> 5.876068, 9.996501, 14.030177, 14.768608, 12.436722, 8....

Add in lost probability

The Poisson distribution goes to infinity (and beyond!), which we’ve effectively ignored as our data only went to 20. So, we need to add this probability back in. We do that now by adding the missing frequencies to the last group.

extinctTable[8, "exp"] <- extinctTable[8, "exp"] + (summaryTable$total - sum(extinctTable$exp))
glimpse(extinctTable)

## Observations: 8
## Variables: 3
## $ groups <chr> "0 or 1", "2", "3", "4", "5", "6", "7", "8 or more"
## $ obs    <dbl> 13, 15, 16, 7, 10, 4, 2, 9
## $ exp    <dbl> 5.876068, 9.996501, 14.030177, 14.768608, 12.436722, 8....

Perform chi-squared goodness-of-fit test

Finally, we perform the \(\chi^2\) goodness-of-fit test.

saveChiTest <- chisq.test(extinctTable$obs, p = extinctTable$exp/summaryTable$total)

## Warning in chisq.test(extinctTable$obs, p = extinctTable$exp/summaryTable
## $total): Chi-squared approximation may be incorrect

Note that the chisq.test function assumes the degrees of freedom are number of categories minus one, but we lost one more degree of freedom since we had to estimate the mean of the Poisson distribution from the data. Since our number of categories is 8, we have a degree of freedom given by \(df = 8 - 1 - 1 = 6\). Thus, we will calculate the \(P\)-value manually using pchisq, but use the already calculated test statistic from the chisq.test function (which is why we saved the result in saveChiTest).

pValue <- pchisq(q = saveChiTest$statistic, 
                 df = 6,
                 lower.tail = FALSE)
unname(pValue)

## [1] 0.0005334919

So what’s our conclusion? We reject the null hypothesis that extinctions in the fossil record follow a Poisson distribution.

Mass extinctions with dplyr and tidyr

M. Drew LaMar

March 1, 2017