title: “The Bootstrap” author: “Rajaram Nityananda” date: “September 9, 2019”

This Rmd document illustrates the bootstrap using the example of a journalist sampling a bunch of an odd number ‘nv’ voters after they have voted (exit poll) but before the full counting has happened. Based on who has more votes in the sample, one predicts the winner. This first code chunk estimates the probability of this going wrong. To generate the sample, we need to give the sample size, and also the fraction of supporters of the winner (A) in the full population

#nv <- as.integer(readline("sample size (please enter an #odd integer) = "))
nv<-19
print("now we are simulating a thousand surveys ")

## [1] "now we are simulating a thousand surveys "

#f<-as.numeric(readline("fraction of A supporters #(between 0.5 and 1)=  "))
f<-0.65
#we now have chosen the sample size and the fraction of #supporters of the winning candidate
count <- 0 #we are going to count wrong results
for (n in (1:1000)) {
  x <- runif(nv) 
if (length(x[x>(1-f) ])<ceiling(nv/2)) #checking if the 
#number of A supporters is less than the majority
{count <- count+1}# in which case our count goes up by 1
 }
 cat(c("failure probability by simulation of multiple samples = ", as.character( (count/1000)))) #this is a way of printing out the  #failure fraction

## failure probability by simulation of multiple samples =  0.081

This way of estimating the failure probability is equivalent to sending in thousand journalists, and seeing how many of them get the right answer. The miracle of bootstrap is that (at a price) we can do this with just ONE sample. The trick is to use this one sample to get more samples, by ‘resampling’ For example, if we have 15 people interviewed of whom 11 support A, we create new samples of 15 by sampling this with replacement. Clearly, just by chance, some of these new samples will have more than 11, some less. This is equivalent to creating a large population by replicating this sample a 1000 times, and then sampling that without replacement (replacement doesnt make a difference for very large samples). This seems like cheating but it works, and the inventor, Brad Efron is a highly respected statistician -he aso writes for a wider audience, if you are curious, please look at his scientific american article though it is not needed for this course

https://www.vanderbilt.edu > quantitative-content > diaconis_efron_1983

Here is the bootstrap code, which differs only in one vital detail from the earlier one. We prepare a single random sample x of size nv outside the loop, and inside the loop create 1000 fake samples xres by resampling and see how many of them go wrong

#nv <- as.integer(readline("sample size (please enter an #odd integer) = "))
nv<-19
#f<-as.numeric(readline("fraction of A supporters #(between 0.5 and 1)=  "))
f<-0.65
print("Now we are doing bootstrap with just one survey")

## [1] "Now we are doing bootstrap with just one survey"

count <- 0 #we are going to count wrong results
x<-runif(nv)
for (n in (1:1000)) {
  xres <- sample(x,nv,replace=TRUE) 
if (length(xres[xres>(1-f) ])<ceiling(nv/2)) #checking #if the number of A supporters is less than the majority
{count <- count+1}# in which case our count goes up by 1
 }
 cat(c("failure probability by bootstrap with one sample = ", as.character( (count/1000)))) #this is a way of printing out the  #failure fraction

## failure probability by bootstrap with one sample =  0.114

You will notice that the bootstrap result doesnt agree exactly with the large simulation result, but is (usually!) comparable, and saves us the labour of going out and actually collecting a large number of surveys, just to know how bad one survey can get. The results (for the failure probability) are variable. Bootstrap has not replaced mathematical analysis but in fact has been the subject of much mathematical analysis.