1 Background

The reasoning behind this example is to show how mathematical and statistical models work. This example will be based on a typical GCSE maths question on probability but it will show how models depend on assumptions and how different ways of building models can give different perspectives.

This will show how from a simple model we can build a more complex version by asking questions of the model and by reducing the number of assumptions. It will also show you how simulations can give unexpected results.

1.1 The initial question

A typical question from a GCSE maths exam testing how to use proportions is as follows:

Abbie wants to find out the number of snails in the garden. One day she catches 30 snails, puts a mark on all of their shells and releases them again. The next day Abbie catches 26 snails. 10 have a mark on their shell. Estimate the total number of snails in the garden.

What assumptions have been made?

The first practical assumption is that the mark did not wash off.

The second is that from day 1 to day 2 the population remains the same. But now it is split into two sub-populations, one marked and one unmarked.

The mathematical assumption is that the probability of catching a marked snails on the second day is the same as the probability of capturing an unmarked snail. We can therefore work out the number of unmarked snails in the garden.

Writing this mathematically we assume:

\[ \frac{10}{30}=\frac{16}{x} \] The 16 in the numerator is the number of unmarked snails collected the second day. This is the total for the second day minus the number of marked snails.

You can rearrange this to show that \(x=48\)

From this we can work out that there are 30 + 48 = 78 snails in the garden in total.

Alternatively we could have created a formula to get the total straight away by saying that the probability for getting a marked snail is the same as the probability of getting any snail.

\[ \frac{10}{30}=\frac{26}{x} \] \[ x = 78\]

Now thinking a bit more about those assumptions. Are they true? Are there any issues with the model?

1.2 Pulling the model apart

Using critical thinking you might be able to see that there are some issues.

From the second sample of snails we have 26 of which 10 were marked previously. The argument is that the fraction of snails caught each day will be the same. For this to be true the probability of capturing a marked snail and an unmarked snail must also be the same.

However if the probability of catching the snails is the same and the population in the garden is the same for both days and for marked and unmarked snails then you should catch the same number of snails everyday. Which from the question we clearly don’t.

The question therefore also assumes something that you did not think of originally and that is that the probability of catching snails varies from day to day even if the probability is the same for marked and unmarked snails. Does this make sense?

We can say that this model is not always consistent. Can we make it consistent?

1.3 A consistent model

Imagine that we know that there are 100 snails in the garden and that the probability of capturing a snail is 0.2 for both marked and unmarked snails. Now we can make a consistent model.

On the first day we capture \(100 \times 0.2\) snails which is 20.

On the second day we have 80 unmarked snails and 20 marked snails from these we capture \(80 \times 0.2\) unmarked snails, which is 16, and \(20 \times 0.2\) marked snails which is 4 making a total of 20 again.

Now all of the assumptions are true and we have consistency.

This is a fully analytical model with two parameters. The population of the snails and the probability of capturing a snail. It has one unique solution and the samples are completely defined.

However the model depends on probability and so we can create a probability model which will result in a simulation.

1.4 The probability model

The snails in the garden can either be captured or not captured. They have two states and so this is an example of a binomial probability. This is analagous to flipping a biased coin 100 times and counting the number of tails.

I can make a simple simulation of this in R where I create a vector (just a set of things) called snails and I put two elements into it called Caught and Escaped. I them use the sample function to create a sample from this with replacement and with a probability of 0.2 for being caught and 0.8 for escaping.

snails <- c("Caught","Escaped")
sample(snails, replace=TRUE, size=100, prob=c(0.2,0.8))
##   [1] "Escaped" "Escaped" "Escaped" "Escaped" "Caught"  "Escaped" "Escaped"
##   [8] "Escaped" "Escaped" "Escaped" "Escaped" "Caught"  "Caught"  "Escaped"
##  [15] "Caught"  "Escaped" "Escaped" "Escaped" "Escaped" "Caught"  "Escaped"
##  [22] "Escaped" "Escaped" "Escaped" "Escaped" "Caught"  "Escaped" "Escaped"
##  [29] "Caught"  "Escaped" "Escaped" "Escaped" "Caught"  "Escaped" "Caught" 
##  [36] "Escaped" "Caught"  "Caught"  "Caught"  "Escaped" "Escaped" "Escaped"
##  [43] "Caught"  "Escaped" "Escaped" "Caught"  "Escaped" "Escaped" "Caught" 
##  [50] "Escaped" "Escaped" "Escaped" "Escaped" "Escaped" "Escaped" "Escaped"
##  [57] "Caught"  "Escaped" "Caught"  "Caught"  "Escaped" "Escaped" "Escaped"
##  [64] "Escaped" "Escaped" "Escaped" "Escaped" "Escaped" "Caught"  "Caught" 
##  [71] "Escaped" "Escaped" "Caught"  "Caught"  "Escaped" "Escaped" "Escaped"
##  [78] "Escaped" "Caught"  "Escaped" "Caught"  "Escaped" "Escaped" "Escaped"
##  [85] "Caught"  "Escaped" "Escaped" "Caught"  "Caught"  "Escaped" "Escaped"
##  [92] "Escaped" "Escaped" "Escaped" "Caught"  "Caught"  "Escaped" "Escaped"
##  [99] "Escaped" "Escaped"

1.5 The Binomial simulation

Simulating the output for all 100 snails is a rather clunky simulation as we just need the total number caught. That is the number of successful trials from 100 with a probability of success of 0.2. We can do that with the following code. Which creates a single random number from the binomial distribution from 100 trials with a probability of success of 0.2.

caught <- rbinom(1,100, 0.2)
caught
## [1] 17

We could run this simulation 10, 100 or even 1000 times. To check how the number of snails caught varies.

caught <- rbinom(10,100, 0.2)
caught
##  [1] 20 19 20 18 23 17 17 19 16 16

We can plot a histogram of the frequency distribution of the number of snails caught.

caught <- rbinom(1000,100, 0.2)
hist (caught, main="Number of Snails Caught", xlab="Number of snails")

This is good as on average there are about 20 caught which is what we expect but because it is a probabilistic model there is variation and there can be less than 10 or more than 30 caught.

1.6 Making a Binomial model of capture and recapture

To make the capture and recapture simulation we need to combine 3 binomial processes. One for the first day. One for the second day for the unmarked snails and one for the second day for the marked snails. The total number of snails will be fixed at 100 and the probability at 0.2.

marked <- rbinom(1,100, 0.2)
unmarked <- 100 - marked
caught <- rbinom(1,unmarked,0.2)
recaptured <- rbinom(1,marked,0.2)
total_day2 <- recaptured+caught
marked
## [1] 20
caught
## [1] 16
recaptured
## [1] 1
total_day2
## [1] 17

We can check this model by doing the calculation as before in the original question except that we know that the answer should be 100.

Substituting these results into the equation that we used before and assuming the same proportions:

\[ \frac{20}{x}=\frac{1}{17} \]

x <- (marked * total_day2)/recaptured 
x
## [1] 340

This shows that for a single simulation because of the probabilistic nature you get variation and you also sometimes get non-integer results which are clearly impossible.

What you can do is as before simulation this for 10, 100 or even 1000 simulations and calculate the values of x for all of them.

marked <- rbinom(1000,100, 0.2)
unmarked <- 100 - marked
caught <- rbinom(1000,unmarked,0.2)
recaptured <- rbinom(1000,marked,0.2)
total_day2 <- recaptured+caught
x <- (marked * total_day2)/recaptured 
hist(x, main="The estimate of the snail population", xlab="Number of snails")

The good news is that the simulations do cluster around a population estimate of 100. The bad news is that there is a very long tail that can sometimes generate populations of 400 or 500 which are wrong by a significant factor.

The simulations have shown that in reality collecting samples of snails and estimating the population from them might give you very misleading results.

1.7 More complex models

These models do not take into account factors such as predation. For example marking the snails might make them easier to find by predators and so result in a larger portion of the marked snails being eaten between samples. Marking could also make the marked snails easier to recapture resulting in different probabilities for the marked and unmarked snails. If the time period was longer than a day then there would be natural births and deaths. The added levels of complexity are almost endless and the real models used for capture and recapture studies are much more mathematically advanced.

1.8 Questions that we could ask of the model

  1. Does changing the probability change the shape shape of the final distribution of snail population estimates?
  2. How does changing the snail population size change the shape of the final distribution of snail population estimates?
  3. What happens if the probabilities change between day 1 and day 2?
  4. What happens if the populations change between day 1 and day 2?