Statistics 208, Karen LaRocque, Problem Set #2

Problem 6.9

Generate 100 bootstrap replicates of the correlation coefficient for the law school data. From these, compute the bootstrap estimate of standard error for the correlation coefficient. Compare your results to those in Table 6.1 and Figure 6.2.

Set seed for consistency

set.seed(127)

Get law school data.

library(bootstrap)
dat <- data(law)

Generate 100 boostrap replicates & plot histogram.

n <- nrow(law)  #n=15
B <- 100
rep.boot <- numeric(B)
for (i in 1:B) {
    idx <- sample(1:n, size = n, replace = TRUE)  #sample indices
    d.boot <- law[idx, ]  #sample pairs from data
    rep.boot[i] <- cor(d.boot)[1, 2]  #compute & store correlation
}
hist(rep.boot, main = "100 Bootstrap Replications of Correlation Coefficient")

plot of chunk unnamed-chunk-3

Compute boostrap estimate of standard error.

se <- sd(rep.boot)
cat("Bootstrap estimate of standard error is", se)
## Bootstrap estimate of standard error is 0.1233

Compare your results with the information in Table 6.1 & Figure 6.2.

The bootstrap estimate of the standard error is 0.1233. Although this was only based on 100 bootstrap samples, this is close to the value in Table 1 after 3200 bootstrap samples (.132). Moreover, much like the histogram in Figure 6.2, the distribution of bootstrap replications of the correlation coefficient is notably negatively skewed.

Problem 6.10

Consider an artificial data set consisting of the 8 numbers:

1, 2, 3.5, 4, 7, 7.3, 8.6, 12.4, 13.8, 18.1.

Let \( \boldsymbol{\hat{\theta}} \) be the 25% trimmed mean, computed by deleting the smallest two numbers and largest two numbers, and then taking the average of the remaining four numbers.

Note: There are 10 numbers here, not 8 as is stated by the problem. I am letting \( \boldsymbol{\hat{\theta}} \) be computed by deleting the smallest two numbers and largest two numbers and taking the average of the remaining six numbers

(a) Calculate \( \boldsymbol{\hat{se}_{B}} \) for B=25,100,200,500,1000,2000. From these results estimate the ideal boostrap estimate \( \boldsymbol{\hat{se}_{\infty}} \)

Type in the data.

dat <- c(1, 2, 3.5, 4, 7, 7.3, 8.6, 12.4, 13.8, 18.1)

Generate bootstrap replications.

n <- length(dat)  #n=8
B <- 2000
boot <- numeric(B)
for (i in 1:B) {
    idx <- sample(1:n, size = n, replace = TRUE)  #choose 8 indices with replacement
    boot[i] <- mean(sort(dat[idx])[3:8])  #draw sample, compute & store trimmed mean
}
hist(boot, main = "2000 Bootstrap Replications of Trimmed Mean")

plot of chunk unnamed-chunk-6

Compute boostrap estimate of standard error.

Also, store coefficient of variation for later.

B_list <- c(25, 100, 200, 500, 1000, 2000)
se_list <- rep(NA, length(B_list))
cv_list <- rep(NA, length(B_list))
for (i in 1:length(B_list)) {
    se_list[i] <- sd(boot[1:B_list[i]])
    cv_list[i] <- sd(boot[1:B_list[i]])/mean(boot[1:B_list[i]])
}
names(se_list) <- paste(B_list, "samples")
print(paste("Standard error with", B_list, "samples:", se_list))
## [1] "Standard error with 25 samples: 2.46104370361664"  
## [2] "Standard error with 100 samples: 2.03137187366404" 
## [3] "Standard error with 200 samples: 2.04383001818059" 
## [4] "Standard error with 500 samples: 2.14038028317969" 
## [5] "Standard error with 1000 samples: 2.08123361916538"
## [6] "Standard error with 2000 samples: 2.06044423296604"

To estimate the ideal bootstrap estimate we need \( E(\Delta) \), which we can estimate as the kurtosis of the empirical distribution (footnote, p. 52).

We then can use equation 6.9 and choose \( cv(\hat{se}_{\infty}) \) to minimize the squared error between the observed and predicted \( cv(\hat{se}_{B}) \) (or any other optimization methods). We can then multiply \( cv(\hat{se}_{\infty}) \) by the mean estimate of \( \hat{\theta} \) (used as an estimate of mean for ideal bootstrap) to derive \( \hat{se}_{\infty} \)

library(e1071)
delta <- kurtosis(dat)  # estimate delta
fitse_b <- function(seinf, cv_list, B_list) {
    fit <- sqrt(seinf * seinf + (delta + 2)/(4 * B_list))  # estimate cvs observed
    sqer <- (fit - cv_list)^2  # squared error
    sum(sqer)  # sum of squared error
}
cv_ideal <- optimize(fitse_b, interval = c(0, 50), cv_list, B_list)  #find cv of ideal bootstrap that minimizes sum of squares
se_ideal <- cv_ideal$minimum * mean(boot)
cat("Estimate of ideal standard error is", se_ideal)
## Estimate of ideal standard error is 2.079

(b) Repeat part (a) using ten different random number seeds and hence assess the variability in the estimates. How large should we take B to provide satisfactory accuracy?

Repeat part (a) & store results in table.

seeds <- c(abs(ceiling(rnorm(10) * 100)))  # pick ten seeds
print(seeds)  # make sure no repeats
##  [1]  30  73 214 108  63 199  45  31 229  85

n <- length(dat)
B <- 2000
B_list <- c(25, 100, 200, 500, 1000, 2000)
se_list <- matrix(NA, length(seeds), length(B_list))

for (i in 1:length(seeds)) {

    set.seed(seeds[i])  #new seed

    boot <- numeric(B)
    for (k in 1:B) {
        idx <- sample(1:n, size = n, replace = TRUE)  #choose 8 indices with replacement
        boot[k] <- mean(sort(dat[idx])[3:8])  #draw sample, compute & store trimmed mean
    }

    for (j in 1:length(B_list)) {
        se_list[i, j] <- sd(boot[1:B_list[j]])  #calculate se for each B
    }
}

se_frame <- as.data.frame(se_list)
names(se_frame) <- paste(B_list, "samples")
rownames(se_frame) <- paste("seed", 1:10)

Below are the standard errors obtained from 10 seeds x 6 sizes of bootstrap sample, along with the standard deviation of those estimates. In this instance it appears that the variability in the estimate drops from 200 samples to 500 samples, so I would use 500 bootstrap samples to get good accuracy.

print(se_frame)
##         25 samples 100 samples 200 samples 500 samples 1000 samples
## seed 1       1.706       1.844       2.000       2.062        2.087
## seed 2       2.546       2.192       2.085       1.967        1.968
## seed 3       1.827       2.063       2.053       2.116        2.063
## seed 4       1.401       1.845       1.876       1.995        2.025
## seed 5       1.766       1.957       2.126       2.098        2.034
## seed 6       1.812       2.098       2.085       2.075        2.122
## seed 7       2.424       2.049       2.244       2.103        2.029
## seed 8       2.207       1.991       1.913       1.969        2.000
## seed 9       2.004       2.031       2.037       1.986        2.061
## seed 10      2.009       2.224       2.059       1.994        1.956
##         2000 samples
## seed 1         2.031
## seed 2         1.985
## seed 3         2.046
## seed 4         2.053
## seed 5         2.011
## seed 6         2.069
## seed 7         2.020
## seed 8         2.017
## seed 9         2.013
## seed 10        1.958

print("Standard deviation of estimated standard error:")
## [1] "Standard deviation of estimated standard error:"
print(apply(se_frame, 2, sd))
##   25 samples  100 samples  200 samples  500 samples 1000 samples 
##      0.34590      0.12707      0.10420      0.05978      0.05148 
## 2000 samples 
##      0.03233

(c) Calculate the ideal bootstrap estimate \( \boldsymbol{\hat{se}_{\infty}} \) directly using formula 6.8. Compare the answer to that obtained in part (a)

Generate indices for all possible samples.

prevsamp <- matrix(1:length(dat), nrow = length(dat))  #start with possible samples for 'slot1'
# generate all new samples for each subsequent column restrict to >=
# previous column to avoid duplicates
for (cc in 2:length(dat)) {
    newsamp <- matrix(ncol = cc)
    for (rr in 1:nrow(prevsamp)) {
        # get last idx in sample
        mn <- prevsamp[rr, cc - 1]
        for (k in mn:length(dat)) {
            newsamp <- rbind(newsamp, c(prevsamp[rr, ], k))
        }
    }
    prevsamp <- newsamp[-1, ]
}
idx <- prevsamp

Make sure that we generated the right number of indices.

print(paste("Generated", nrow(idx), "indices for all possible samples. Expecting to generate", 
    choose(19, 10), "indices."))
## [1] "Generated 92378 indices for all possible samples. Expecting to generate 92378 indices."

Draw bootstrap samples using all possible indices & calculate and store bootstrap replications of trimmed mean.

bootall <- rep(NA, nrow(idx))
for (i in 1:nrow(idx)) {
    ind <- idx[i, ]
    bootall[i] <- mean(sort(dat[ind])[3:8])  #draw sample, compute & store trimmed mean
}
hist(bootall, main = "All Possible Bootstrap Replications of Trimmed Mean")

plot of chunk unnamed-chunk-13

Calculate standard error of ideal bootstrap.

se_ideal_emp <- sd(bootall)
print(paste("The direct estimate of se_ideal is", se_ideal_emp, "."))
## [1] "The direct estimate of se_ideal is 2.76022508332454 ."

The direct estimate of \( \hat{se}_{\infty} \) is 2.7602, which is larger than our previous estimate of 2.0787.

Problem 7.3

Calculate the probability that any particular row of the 88 x 5 data matrix X appears exactly k times in a bootstrap matrix X, for k=0,1,2,3

This problem can be treated like a binomial distribution in which n samples (n=88 samples) are drawn and the probability of 'success' is the probability that our particular row is chosen (p=1/88). Thus, the probability that our particular row appears exactly k times in the bootstrap matrix X is \( p = {n \choose k} p^{k}*(1-p)^{n-k} \). These probabilities are printed below.

n <- 88
p <- 1/88
binomprobs <- tapply(1:3, 1:3, function(x) choose(n, x) * (p^x * (1 - p)^(n - 
    x)))
names(binomprobs) <- paste0("k=", 1:3)
print(binomprobs)
##     k=1     k=2     k=3 
## 0.36998 0.18499 0.06096

Problem 7.4

A useful approximation for a binomial distribution Bi(n,p) is Po(np). The approximation becomes more accurate as n gets large and p gets small.

The probability that any particular member of x appears exactly k times in x* is \( Po(np) = Po(\frac{1}{88} \times 88) = Po(1) \)

Thus, \( Pr(x=k) = \frac{e^{-1} \times 1^{k}}{k!} = \frac{e^{-1}}{k!} \).

These probabilities are printed below.

poisprobs <- tapply(1:3, 1:3, function(x) (exp(-1))/factorial(x))
names(poisprobs) <- paste0("k=", 1:3)
print(poisprobs)
##     k=1     k=2     k=3 
## 0.36788 0.18394 0.06131

The probabilities derived from the binomial distribution and the poisson approximation are very similar. As shown below, they differ by < .002 in absolute value, or less than 1% in relative probability. This may be because p is small (1/88) and n is moderately large (88).

print(poisprobs - binomprobs)
##        k=1        k=2        k=3 
## -0.0021042 -0.0010521  0.0003581
print(poisprobs/binomprobs)
##    k=1    k=2    k=3 
## 0.9943 0.9943 1.0059

Problem 7.6

Carry out a bootstrap analysis of the principal components based on C and produce the corresponding plots to Fig 7.3 and 7.4. Discuss any differences between the two analyses.

Get data.

library(bootstrap)
data(scor)
set.seed(411)

Carry out boostrap replications.

B <- 200
n <- nrow(scor)
eigenvals <- matrix(NA, B, 5)
eigenv1 <- matrix(NA, B, 5)
eigenv2 <- matrix(NA, B, 5)
for (b in 1:B) {
    idx <- sample(1:n, n, replace = TRUE)  #sample indices
    d <- scor[idx, ]
    C <- cor(d)  #get C matrix
    eig <- eigen(C)
    eigenvals[b, ] <- eig$values
    eigenv1[b, ] <- eig$vectors[, 1]
    eigenv2[b, ] <- eig$vectors[, 2]
}

First, estimate standard error of \( \hat{\theta} \) and \( \hat{v}_{1} \) and \( \hat{v}_{2} \).

thetas <- apply(eigenvals, 1, function(x) x[1]/sum(x))
se_thetas <- sd(thetas)
cat("The standard error of thetahat is", se_thetas)
## The standard error of thetahat is 0.04211

v1_se <- apply(eigenv1, 2, sd)
v2_se <- apply(eigenv2, 2, sd)

cat("The standard errors for v1 are", v1_se, "and the standard errors for v2 are", 
    v2_se)
## The standard errors for v1 are 0.2505 0.2676 0.31 0.2795 0.266 and the standard errors for v2 are 0.4572 0.3459 0.1033 0.2959 0.3643

Next, look at boxplots for eigenvectors 1 and 2. (Reproduce Fig 7.3).

boxplot(eigenv1, main = "First eigenvector")

plot of chunk unnamed-chunk-21

boxplot(eigenv2, main = "Second eigenvector")

plot of chunk unnamed-chunk-21

Next, look at connected lines for eigenvectors 1 and 2. (Reproduce Fig 7.4).

plot(x = -1, y = -10, xlim = c(0, 6), ylim = c(-1, 1), xlab = "component", ylab = "estimate", 
    main = "First eigenvector")  #dummy plot
matlines(t(matrix(rep(1:5, each = nrow(eigenv1)), nrow(eigenv1), ncol(eigenv1))), 
    t(eigenv1), col = "black")  #lines

plot of chunk unnamed-chunk-22


plot(x = -1, y = -10, xlim = c(0, 6), ylim = c(-1, 1), xlab = "component", ylab = "estimate", 
    main = "Second eigenvector")  #dummy plot
matlines(t(matrix(rep(1:5, each = nrow(eigenv2)), nrow(eigenv2), ncol(eigenv2))), 
    t(eigenv2), col = "black")  #lines

plot of chunk unnamed-chunk-22

Using the C matrix yields standard errors of \( \hat{\theta} \) and the absolute values of the estimates for each of the weights in the first and second eigenvectors are very similar to the example in the book using the G matrix. Here, the signs of the weights were more variable than in the book example, so the standard errors of the components of the vectors are larger. (But this is also the case when I use the G matrix, so I don't think this is only a function of using the C matrix).

Problem 8.7

(a) Generate a second-order autoregressive time series with gaussian white noise serving as its disturbance term.

The time series is generated and plotted below.

set.seed(986)
n <- 48  #use length from text
betas <- c(0.771, -0.222)  #use betas from text
epsilon <- rnorm(50 + n, 0, 1)  #gaussian white noise
ts <- filter(epsilon, filter = betas, method = "recursive")
ts <- ts[51:length(ts)]  #drop first 50 points
ts <- scale(ts, center = TRUE, scale = FALSE)  #center time series
ts.plot(ts)

plot of chunk unnamed-chunk-23

(b) Estimate the coefficients of the second order autoregression fit to the data generated in (a)

The estimated coefficients are printed below.

ft <- ar(ts, order.max = 2, method = "ols")  #fit coefficients
coefest <- ft$ar[, , 1]  #get coefficients
cat("Estimated beta-1 is", coefest[1], "and estimated beta-2 is", coefest[2])  #print coefficients
## Estimated beta-1 is 0.854 and estimated beta-2 is -0.4118

(c) Obtain the bootstrap distribution associated with the fit in (b).

First define timeseries generation function, adapted from AR(1) example in class.

generate_ar2 <- function(betas, t1, t2, epsilon) {
    # assumes epsilon has (NA,NA,...) follows book and just uses first two tp
    # from sample
    n <- length(epsilon)
    ts <- rep(0, n)
    ts[1:2] <- c(t1, t2)
    for (t in 3:n) {
        ts[t] <- betas[1] * ts[t - 1] + betas[2] * ts[t - 2] + epsilon[t]
    }
    return(ts)
}

Use disturbance term distribution and estimated coefficients to generate 500 bootstrap replications of the coefficients, and print the distributions.

ehat <- ft$resid[3:length(ft$resid)]  #no estimated eps for first two tps
B <- 500
boot <- matrix(NA, B, 2)  #2 cols two beta terms

for (b in 1:B) {
    t1 <- ts[1]  #following book, just choosing first tp in sample
    t2 <- ts[2]  #following book, just choosing second tp in sample
    eps <- c(NA, NA, sample(ehat, length(ehat), replace = TRUE))  #sample disturbance terms, append 2 NAs to beginning since we won't use disturbance terms for tp1 and tp2
    t <- generate_ar2(coefest, t1, t2, eps)
    fit <- ar(t, order.max = 2, method = "ols")
    cboot <- fit$ar[, , 1]
    if (length(cboot) < 2) 
        cboot <- c(cboot, NA)  #avoid recyling by accident
    boot[b, ] <- cboot
}

hist(boot[, 1], main = "Histogram of Boostrap Replications of Coefficient 1")

plot of chunk unnamed-chunk-26

hist(boot[, 2], main = "Histogram of Boostrap Replications of Coefficient 2")

plot of chunk unnamed-chunk-26

(d) Compare the standard errors for the AR(2) coefficient \( \boldsymbol{\beta_{2}} \) available from R for the autoregressive coefficients and the standard errors available from bootstrapping. Comment on any similarities of discrepancies.

# SE from bootstrap
cat("Bootstrap standard error of first-order coefficient is", sd(boot[, 1], 
    na.rm = TRUE))
## Bootstrap standard error of first-order coefficient is 0.1417
cat("Bootstrap standard error of second-order coefficient is", sd(boot[, 2], 
    na.rm = TRUE))
## Bootstrap standard error of second-order coefficient is 0.1182

# SE from R
cat("R standard error of first-order coefficient is", ft$asy.se.coef$ar[1])
## R standard error of first-order coefficient is 0.1317
cat("R standard error of second-order coefficient is", ft$asy.se.coef$ar[2])
## R standard error of second-order coefficient is 0.1321

The estimated standard error for the second-order coefficient is very similar to the estimate from R (0.1182 compared to 0.1321). However, the standard error obtained by bootstrapping is slightly lower than that obtained by R (at least for the second-order coefficient).

(e) Using your answers to (a) and (b), do an actual sampling experiment, not a boostrap experiment, to determine the actual standard error of \( \boldsymbol{\beta_{2}} \). Compare the results with the results from (c) and (d).

Create 500 time series with randomly generated disturbance terms and estimate the beta coefficients.

n <- 48  #use length from text
betas <- c(0.771, -0.222)  #use betas from text
sampexp <- matrix(NA, B, 2)  #draw same number of real samples as bootstrap replications

for (s in 1:B) {

    # generate data
    epsilon <- rnorm(50 + n, 0, 1)  #gaussian white noise
    ts <- filter(epsilon, filter = betas, method = "recursive")
    ts <- ts[51:length(ts)]  #drop first 50 points
    ts <- scale(ts, center = TRUE, scale = FALSE)  #center time series

    # fit data
    ft <- ar(ts, order.max = 2, method = "ols")  #fit coefficients
    coefest <- ft$ar[, , 1]  #get coefficients
    if (length(coefest) < 2) 
        coefest <- c(coefest, NA)  #avoid recyling by accident
    sampexp[s, ] <- coefest
}

Plot histograms of coefficients obtained from drawing 500 samples from the population. The histograms look similar in distribution / variance to those obtained from the boostrap replication. However, the mean of the sampling distribution is closer to the true mean (although still not quite there) as the boostrap replication mean was influenced by the mean of our single sample from the population.

par(mfrow = c(2, 1))
hist(sampexp[, 2], main = "Histogram of Sampling Replications of Coefficient 2", 
    xlim = c(-1, 0.2))
hist(boot[, 2], main = "Histogram of Bootstrap Replications of Coefficient 2", 
    xlim = c(-1, 0.2))

plot of chunk unnamed-chunk-29

par(mfrow = c(1, 1))

The estimated standard errors from the sampling experiment and bootstrap experiment are very similar (0.1098 compared to 0.1182).

# SE from sampling experiment
cat("Sampling experiment standard error of second-order coefficient is", sd(sampexp[, 
    2], na.rm = TRUE))
## Sampling experiment standard error of second-order coefficient is 0.1098
cat("Bootstrap standard error of second-order coefficient is", sd(boot[, 2], 
    na.rm = TRUE))
## Bootstrap standard error of second-order coefficient is 0.1182