Explain what is wrong with each of the following statements.
The standard deviation of the bootstrap distribution will be smaller than the standard deviation of the original sample because the originial sample has oberservations, which have more variance than the sample means of the bootstrap distribution.
The bootstrap distribution is created by resampling \(\textit{with}\) replacement from the original sample.
When generating the resamples, it is best to use a sample size \(\textit{significantly larger}\) than the size of the original sample to model the possible distributions.
The bootstrap distribution is created by resampling with replacement from the population.
Do new “directed reading activities” improve the reading ability of elementary school students, as measured by their Degree of Reading Power” (DRP) scores? A study assigns students at random to either the new method (treatment group, 21 students) or traditional teaching methods (control group, 23 students). To illustrate the process, let’s perform a permutation test by hand for a small random subset of the DRP data. Here is the subset:
Treatment Group 57 61
Control Group 42 62 41 28
Calculate the difference in means \(x_{treatment} − x_{control}\) between the two groups. This is the observed value of the statistic.
\(\overline{x}_{treatment} = 59\)
\(\overline{x}_{control} = 43.25\)
\(\overline{x}_{treatment} - \overline{x}_{control} = 15.75\)
Resample: Start with the six scores and choose an SRS of two scores to form the treatment group for the first resample. (Hint: You can do this using the sample function in R or you could even use a 6-sided die. Using either method, be sure to skip repeated digits.) A permutation resample is an SRS, without replacement. The remaining four scores are the control group. What is the difference in the group means for this sample?
SRS Treatment: 62, 41 Control: 57, 42, 62, 28
\(\overline{x}_{treatment} = 51.5\)
\(\overline{x}_{control} = 47.25\)
\(\overline{x}_{treatment} - \overline{x}_{control} = 4.25\)
Repeat part (b) 20 times to get 20 resamples and 20 values of the statistic. Make a histogram of the distribution of these 20 values. This is the permutation distribution for your resamples.
set.seed(3)
data <- c(62, 41, 57, 42, 62, 28)
distXbar <- c()
nSim = 20
n = length(data)
for(i in 1:nSim){
bootSamp <- sample(1:n, 2, replace = F)
XbarTreat <- mean(data[bootSamp])
XbarCont <- mean(data[-bootSamp])
Xbar = XbarTreat - XbarCont
distXbar <- c(distXbar, Xbar)
}
distXbar
## [1] 4.25 -20.50 0.50 -10.75 4.25 5.00 -5.50 20.00 4.25 5.00
## [11] -20.50 0.50 16.25 16.25 -21.25 5.00 1.25 4.25 4.25 -5.50
hist(distXbar)
What proportion of the 20 statistic values were equal to or greater than the original value in part (a)? You have just estimated the one-sided P-value for these 6 observations.
p <- c()
for(i in 1:nSim){
if(distXbar[i] >= 15.75)
p <- c(p, distXbar[i])
}
length(p)/20
## [1] 0.15
For this small data set, there are only 15 ( 6 choose 2) possible permutations of the data. As a result, we can calculate the exact p-value by counting the number of permutations with a statistic value greater than or equal to the original value and then dividing by 15. What is the exact p-value here? How close was your estimate?
The exact p-value is 0.20, while the estimated value was 0.15.
library(gtools)
index <- combinations(length(data), 2, v = 1:length(data), data, repeats.allowed = F)
## Warning in if (set) {: the condition has length > 1 and only the first element
## will be used
#index
exactDistXbar <- c()
for(i in seq(1, to = length(index)/2, 1)) {
#message("i: ",i)
#bootSamp <- sample(1:n, 2, replace = F)
exactXbarTreat <- mean(data[index[i, ]])
exactXbarCont <- mean(data[-(index[i,])])
#print(data[index[i, ]])
#print(data[-(index[i,])])
exactXbar = exactXbarTreat - exactXbarCont
exactDistXbar <- c(exactDistXbar, exactXbar)
}
#message("Xbar of the 15 Permutations: ")
#exactDistXbar
pCount = 0
for(i in 1:length(exactDistXbar)) {
if (exactDistXbar[i] >= 15.75) {
pCount = pCount + 1
}
}
pValue = pCount/15
pValue
## [1] 0.2
We want to compare the mean diameter at breast height (DBH) for trees from the northern and southern halves of a land tract using a random sample of 30 trees from each region. These data are available on the WISE site, named nspines.csv.
Use a side-by-side boxplot or faceted histograms to examine the data graphically (splitting by region). Does it appear reasonable to use standard t procedures? (Hint: Recall assumptions for t-procedures such as approximate normality and/or a sufficiently large sample size)
There appears to be skew to a low DBH for the north and a skew to hich DBH in the south. Because the data is not normal, using standart t procedures is not recommended.
pines <- read.csv("nspines.csv", head = T)
boxplot(pines$dbh ~ pines$ns, xlab = "Region", ylab = "DBH")
Calculate our observed statistic \(\overline{x}_{North} − \overline{x}_{South}\)
npines <- pines[1:30,]
spines <- pines[31:60,]
nMean <- mean(npines$dbh)
sMean <- mean(spines$dbh)
diffMean <- nMean - sMean
diffMean
## [1] -10.83333
Bootstrap the difference in means \(\overline{x}_{North} − \overline{x}_{South}\) (at least 1000 times) and look at the bootstrap distribution. (Include the histogram).
bootstrapCI2 <- function(data1, data2, nsim) {
n1 <- length(data1)
n2 <- length(data2)
bootCI2 <- c()
for (i in 1:nsim) {
bootSamp1 <- sample(1:n1, n1, replace = TRUE)
bootSamp2 <-sample(1:n2, n2, replace = TRUE)
thisXbar <- mean(data1[bootSamp1])-mean(data2[bootSamp2])
bootCI2 <- c(bootCI2, thisXbar)
}
return(bootCI2)
}
pinesBootCI2 <- bootstrapCI2(npines$dbh, spines$dbh, nsim = 1000)
hist(pinesBootCI2)
There are two ways we discussed calculating a bootstrap confidence interval, the quantile method and the hybrid method (this is also known as the “bootstrap t confidence interval” because it combines the standard error from the bootstrap with the critical value from the t). Calculate both types of confidence intervals.
quantile(pinesBootCI2, c(.025, .975))
## 2.5% 97.5%
## -18.41358 -3.22525
se <- sd(pinesBootCI2)
mean(nMean)-mean(sMean)+c(-1,1)*qt(.975, df=58)*se
## [1] -18.591798 -3.074868
nMean - sMean
## [1] -10.83333
If the bootstrap distribution is approximately Normal and the bias is small, we can use the hybrid method (“bootstrap t confidence interval”); however, it is vulnerable to departures from Normality and large bias. Comment on whether the conditions for the hybrid method (“bootstrap t confidence interval”) are met. Do you believe this interval would be reliable? Note: The bootstrap estimate of the bias is the difference between the mean of the bootstrap distribution and the statistic from the original data. This means that we want the bootstrap distribution to be centered around the observed statistic.
The test statistic of -10.833 is close to the bootstrap mean of -10.820, meaning that there is a low bias. We can use the hybrid method because the bootstrap distribution is normal and has symmetry.
mean(pinesBootCI2)
## [1] -10.94349
qqnorm(pinesBootCI2)
Compare the bootstrap results with the usual two-sample t confidence interval. How do the intervals differ? Which would you use?
t.test(npines$dbh, spines$dbh)
##
## Welch Two Sample t-test
##
## data: npines$dbh and spines$dbh
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.090199 -2.576468
## sample estimates:
## mean of x mean of y
## 23.70000 34.53333