Licensed under the Creative Commons attribution-noncommercial license. Use to share and modify, and cite the source. http://creativecommons.org/licenses/by-nc/3.0/

Bootstrapping cases when there is no theory

Suppose that we are interested in the inequality of the number of insects in monitoring traps in crop fields. The inequality can be measured with the Gini coefficient. This coefficient has a value of zero when all values are the same, and can approach 1.0 as the distribution tends to have all traps but one with zero insects and one trap with a large number of insects. The information we have is the number of insects in each trap in a simple random sample of 20 fields.

If we could assume that the number of insects has a known distribution \(f(y)\), then the population Gini coefficient could be calculated using the equation that defines it:


\[G = \frac{1}{2\mu} \ \sum^n_{i = 1} \sum^n_{j = 1}f(y_i) \ f(y_j) \ |y_i - y_j| \\[25pt] \text{where} \ f(y_i) \ \text{is the probability mass function of Y, and} \\[25pt] \mu = \sum^n_{i = 1} \ y_i \ f(y_i)\]


The Poisson distribution would be suitable to model \(f(y)\) if insects were trapped independently from each other. This is probably not a good assumption, because insects are usually clustered in space and time. This means that the probability that another insect is trapped increases is one is already trapped. The negative binomial distribution would be more appropriate for discrete events that are “contagious.” However, the actual distribution at the time of sampling could be anything, and it probably is not exactly any of the distributions that have been characterized and described.

For the purpose of this example, we will assume that the number of insects per trap has a nonstandard distribution that is known to us but not to those who need to sample it and estimate the Gini coefficient.

The example has the following steps:

  1. Set up the nonstandard distribution
  2. Use simulation to derive the theoretical sampling distribution nonparametrically.
  3. Take a random sample from the distribution and estimate the Gini coefficient and its CI using bootstrapping. Compare to the theoretical distribution.
  4. Repeat the process over many simulated samples from the known distribution and evaluate the expected value of the bootstrapped estimate and the coverage of the confidence intervals.

Nonstandard population distribution

The simulated distribution is defined by a table of 100 values corresponding to the probabilities of having 0 to 99 insects in a trap. The population of fields is assumed to be very large or infinite, so no finite correction is necessary for the estimates.

For simplicity, the distribution is created as a sum of negative binomial distributions truncated at 99 with Poisson noise.

# Simulation of a distribution that could be consistent with number of insects per trap.

one <- dnbinom(0:99, size = 3, mu = 2)
two <- dnbinom(0:99, size = 4, mu = 6)
three <- dnbinom(0:99, size = 3, mu = 15)

c <- c(0.15, 0.00, 0.85, 0.8)

set.seed(1)
add <- rpois(100, 5)/1500

pmf <- (c[1] * one + c[2] * two + c[3] * three + c[4] * add) / 
  sum(c[1] * one + c[2] * two + c[3] * three + c[4] * add)

plot(pmf ~ c(0:99), type = "h", xlab = "y")

sum(0:99*pmf) # calculate mu
## [1] 20.73983

In order to calculate the Gini coefficient for the population we apply the definition above. For sample estimates, will use the estimator defined in the Gini function of the DescTools package.

library(DescTools) # To estimate Gini coefficient

gini.pop.fun <- function(x, pmassf){
  fifj.mat <- pmassf %*% t(pmassf)
  yi.mat <- matrix(rep(x, length(x)), nrow = length(x))
  yj.mat <- matrix(rep(x, length(x)), nrow = length(x), byrow = TRUE)
  abs.yiyj.mat <- abs(yi.mat - yj.mat)
  sum(apply(fifj.mat * abs.yiyj.mat, 1, sum)) / (2 * sum(x*pmassf))
}

gini.pop.fun(0:99, pmf) # Population Gini coefficient
## [1] 0.5134417

Now we have to define a function that can take random samples from the distribution created. We use it to generate an approximate sampling distribution of the functional of interest (Gini) by repeated sampling from the true distribution. Because we created a non-standard distribution and we are using the Gini coefficient, we know there is no theory or closed expressions available to plot the sampling distribution, we create it by repeated sampling from the true distribution.

i.smpl <- function(size){
  sample(x = 0:99, size = size, replace = TRUE, prob = pmf)
}


# Get 40000 samples from the true distribution and calculate
# 40000 values for the Gini coefficient.
# Plot approximate sampling distribution.

gini.dist <- sapply(1:40000, FUN = function(i){
  s <- i.smpl(20)
  Gini(s) # Use bias-corrected estimator
})

str(gini.dist) # Check results is as expected
##  num [1:40000] 0.536 0.639 0.404 0.398 0.547 ...
# Plot sampling distribution
hist(gini.dist, breaks = 175, freq = FALSE, xlab = "Sample Gini coefficient")
lines(density(gini.dist, adjust = 0.25))
abline(v = mean(gini.dist), lwd = 1.5, col = "red")
(popCI <- quantile(gini.dist, c(0.025, 0.975)))
##      2.5%     97.5% 
## 0.3688565 0.6297641
abline(v = popCI, col = "green")

The sampling distribution is uni-modal but it shows skewness.

Draw one sample and bootstrap

In this section we simulate one sample from the distribution and apply the bootstrap. We simulate samples of size 20 throughout this document.

set.seed(0923)

(sample1 <- i.smpl(20))
##  [1]  9  2  1  3  0 18 17  2 14  5  6 27  8 20  4 98 13 37 70 17
Gini(sample1) # bias-corrected estimate
## [1] 0.6183856
(g.star0 <- gini.pop.fun(sample1, rep(1/20, 20))) # population formula on sample
## [1] 0.5874663
g.boot.fun <- function(data, i) {
  data <- data[i]
  Gini(data)
}

library(boot)

g.boot <- boot(data = sample1, 
               statistic = g.boot.fun, 
               R = 5000, 
               sim = "ordinary")

boot.ci(g.boot, type = "bca") # One-sample based CI
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = g.boot, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 95%   ( 0.5039,  0.7530 )  
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
(bias.hat <- mean(g.boot$t) - g.star0)
## [1] -0.005890159
# Compare my bootstrapping with that included in Gini function
Gini(sample1, conf.level = 0.95, R = 5000)
##      gini    lwr.ci    upr.ci 
## 0.6183856 0.5026526 0.7445384
# Try a larger sample to approach true value
sample2 <- i.smpl(100)
# The estimate and CI obtained are 
Gini(sample2, conf.level = 0.95, R = 5000)
##      gini    lwr.ci    upr.ci 
## 0.5118101 0.4546165 0.5737750
# Plot the boostrapped sampling distribution

hist(g.boot$t[,1], breaks = 175, freq = FALSE, xlab = "Sample Gini coefficient")
lines(density(g.boot$t[,1], adjust = 0.25))
abline(v = mean(g.boot$t[,1]), lwd = 1.5, col = "red")
(bootpCI <- boot.ci(g.boot, type = "bca")$bca[4:5])
## [1] 0.5039045 0.7530204
abline(v = bootpCI, col = "green")

Note that the shape of the bootstrap distribution is very similar to the approximate “true” sampling distribution obtained above by simulation. The bias-corrected, accelerated bootstrap CI clearly shows the effect of “mirroring” to account for the skewness of the distribution. The point is not to bracket 95% of the bootstrap estimated sampling distribution, which is centered at a random point that comes from a similar distribution “centered” at the true population expectation. The point it to bracket the unknown population expectation about which each bootstrap estimate very from real sample to real sample.

Simulate 200 samples from true distribution and study boot CI coverage.

# Prepare matrix to receive results
g.mat <- matrix( , nrow = 200, ncol = 3)

# Simulate 200 samples and CI's

system.time(
  for (i in 1:200) {
    obs.dat <- i.smpl(20)
    g.boot <- boot(data = obs.dat, 
                   statistic = g.boot.fun,
                   R = 5000,
                   sim = "ordinary")
    g.star0 <- gini.pop.fun(obs.dat, rep(1/20, 20))
    g.ci <- boot.ci(boot.out = g.boot,
                    type = c("bca"))
#    bias <- mean(g.boot$t) - g.star0
    g.hat <- g.boot$t0# - bias
    g.mat[i, ] <- c(g.ci$bca[4:5], g.hat)
  }
)
##    user  system elapsed 
##  47.394   0.266  47.727
# Plot results

ymin <- 0.2
ymax <- 0.8

g.hat <- g.mat[, 3]
ci.lo <- g.mat[, 1]
ci.hi <- g.mat[, 2]


ci.colors <- c("red", "blue")
ci.color <- ci.colors[as.numeric(ci.lo < gini.pop.fun(0:99, pmf) & 
                                   ci.hi > gini.pop.fun(0:99, pmf)) + 1]

plot(g.hat, xlab = "Sample", ylab = "95% CI", 
     ylim = c(ymin, ymax), col = ci.color, pch = 16)

abline(h = gini.pop.fun(0:99, pmf), col = "grey")

segments(x0 = 1:200, y0 = ci.lo, y1 = ci.hi, col = ci.color)

coverage <-  paste("Coverage = ", as.character(sum(ci.lo < gini.pop.fun(0:99, pmf) & 
                                                     ci.hi > gini.pop.fun(0:99, pmf)) /
                                                 200), "%", sep = "")

text(x = 20, y = 0.25 , labels = coverage, cex = 1)

mean(g.mat[, 3])
## [1] 0.5108688

The BCa confidence interval seems to have lower coverage than the nominal. Several simulations of N = 200 samples each showed that the coverage is about 0.90 for the 95% CI. One simulation with sample size 30 increased coverage to the nominal level.