#Explain what is wrong with each of the following statements
#2a) The standard deviation of the bootstrap distrubution will be approximately the same as the standard deviation of the original sample.
#False, A given sample will have a lot more variation than a bootstrap distribution. A sample has a multitude of values that can range quite a bit, but a bootstrapped distribution is made up of a large quantity of means. Although the means themselves will vary, the simulated means will vary much less than the sample values, making the standard deviation of the bootstrap distribution smaller.
#2b) The bootstrap distribution is created by resampling without replacement from the original sample.
#False, a bootstrapped distribution is created by resampling with replacement.
#2c) When generating the resamples, it is best to use a sample size smaller than the size of the original sample.
#False, the resamples should be the same size as the original sample to create the same number of observations in each resample that were made in the original sample.
#2d) The bootstrap distribution is created by resampling with replacement from the population.
#False, the bootstrap distribution is created by resampling with replacement from the sample.
#3a)difference in means
treatment <- c(57,61)
meanTreatment <- mean(treatment)
control <- c(42, 62, 41, 28)
meanControl <- mean(control)
#difference
meanTreatment-meanControl
## [1] 15.75
#2b)
allData <- c(treatment,control)
sample(allData, 2)
## [1] 62 42
sample_treatment <- mean(c(57,42))
sample_control <- mean(c(61, 62, 41, 28))
sample_treatment-sample_control
## [1] 1.5
#3b) repeat simulation 20 times and create a permutation distribution
set.seed(1234)
n1 <- length(treatment)
n2 <- length(control)
allData <- c(treatment,control)
nsim <- 20
permNull <- c()
for(i in 1:nsim){
smallsample <- sample(1:(n1+n2), n1, replace = FALSE)
thisXbar <- mean(allData[smallsample])-mean(allData[-smallsample])
permNull <-c(permNull,thisXbar)
}
hist(permNull)
#3d) what proportion of the 20 values were equal to or greater than the original value in part a?
permNull
## [1] 19.50 -21.00 16.50 4.50 -6.00 -5.25 -5.25 4.50 5.25 4.50
## [11] 3.75 4.50 -5.25 1.50 5.25 4.50 3.75 -9.00 -20.25 -9.00
#In my data, there were 2 values out of 20 that had a difference at or higher than 15.75. This would result in a p-value of .1.
#3e) Exact p value by calculating the number of permutations with a value greater than or equal to the original value and then dividing by 15.
#Out of 15 possible permutations, there are only 3 permutations that would yield a difference large enough to be equal to or greater than the original difference of 15.75, as the treatment group must have both of the highest values in order for the mean to be substaintially higher than the control group mean. 3 out of 15 would give an exact p value of .2. This is close to my estimate of .1.
calls <- read.csv("calls80.csv", header = TRUE)
#4a) Make a histogram of the call lengths. Describe the shape of the distribution.
hist(calls$length)
#The distribution of call lengths is skewed to the right.
#4b) Bootstrap the data using 1000 resamples and inspect the bootstrap distribution of the mean.
bootStrap <- function(data,nsim){
n<-length(data)
boot<-c()
for(i in 1:nsim){
bootSamp<-sample(1:n, n, replace = TRUE)
thisXbar<- mean(data[bootSamp])
boot<-c(boot,thisXbar)
}
return(boot)
}
callBootstrap <- bootStrap(calls$length,nsim = 1000)
hist(callBootstrap)
#4c) The central part of the distribution is close to Normal. In what way do the tails depart from Normality?
qqnorm(callBootstrap)
#If the data were perfectly normal, the data would create a perfect straight line on the qq plot. However, this is not the case. The qqplot curves upward, indicating a right skew in the data.
#4d) Create and inspect the bootstrap distribution of the sample mean for these data using 1000 resamples. Compared with your distribution from the previous part, is this distribution closer to or farther away from Normal?
srs_calls <- c(104,102,35,211,56,325,67,9,179,59)
srsBootstrap <- bootStrap(srs_calls,1000)
qqnorm(srsBootstrap)
hist(srsBootstrap)
#The distribution and qq plot of the bootstrap of the SRS are very similar to the previous distribution that bootstrapped all the data.
#4e) Compare the bootstrap standard errors for your two sets of resamples. Why is the standard error larger for the smaller SRS?
se_srs <- sd(srsBootstrap)/sqrt(10)
se_srs
## [1] 9.349822
se_callBoot <- sd(callBootstrap)/sqrt(80)
se_callBoot
## [1] 4.218865
#The standard error is larger for the SRS bootstrap because the SRS had a much smaller sample size. This smaller sample will have much more variablity, and then when means are created using this data, the means will have much more variability. This makes the standard error larger because the average amount that a mean will typically differ from the mean of means is greater. The bootstrap with all the data will have less variability, and therefore less variability in the means, making the standard error smaller. Mathematically, Dividing by a smaller number will consequently make the quotient larger. On the other hand, the bootstrap using all the data has a much larger sample, making the standard error smaller.
pines <- read.csv("nspines.csv",TRUE)
#5a) Use a side by side boxplot to examine the data graphically (splitting by region). Does it appear reasonable to use standard t-procedures?
boxplot(pines$dbh ~ pines$ns, xlab = "Region", ylab = "DBH")
#No, it does not appear to be reasonable to use standard t procedures because the data is not very Normal. Both north and south regions are not symmetric. The DBH of the north region is skewed to the right, and the DBH of the south region is skewed to the left.
#5b) Calculate the observed statistic (XbarN-XbarS)
region <- as.factor(pines$ns)
north <- pines[c(1:30),2]
north
## [1] 27.8 14.5 39.1 3.2 58.8 55.5 25.0 5.4 19.0 30.6 15.1 3.6 28.4 15.0
## [15] 2.2 14.2 44.2 25.7 11.2 46.8 36.9 54.1 10.2 2.5 13.8 43.5 13.8 39.7
## [29] 6.4 4.8
south<- pines[c(31:60),2]
south
## [1] 44.4 26.1 50.4 23.3 39.5 51.0 48.1 47.2 40.3 37.4 36.8 21.7 35.7 32.0
## [15] 40.4 12.8 5.6 44.3 52.9 38.0 2.6 44.6 45.5 29.1 18.7 7.0 43.8 28.3
## [29] 36.9 51.6
mean(north)
## [1] 23.7
mean(south)
## [1] 34.53333
mean(north)-mean(south)
## [1] -10.83333
#5c) Bootstrap the difference in means at least 1000 times and look that the bootstap distribution. Include the histogram.
bootStrapCI2<-function(data1, data2, nsim){
n1<-length(data1)
n2<-length(data2)
bootCI2<-c()
for(i in 1:nsim){
bootSamp1<-sample(1:n1, n1, replace=TRUE)
bootSamp2<-sample(1:n2, n2, replace=TRUE)
thisXbar<-mean(data1[bootSamp1])-mean(data2[bootSamp2])
bootCI2<-c(bootCI2, thisXbar)
}
return(bootCI2)
}
pinesBootstrap <- bootStrapCI2(north,south,10000)
hist(pinesBootstrap)
#5d) Calculate a confidence interval using the hybrid method and using the quantile method.
#Quantile method
quantile(pinesBootstrap,c(.025, .975))
## 2.5% 97.5%
## -18.71700 -2.66975
#Hybrid method
se <- sd(pinesBootstrap)
(mean(north)-mean(south))+c(-1,1)*qt(.975,df=58)*se
## [1] -19.02189 -2.64478
#5e) Are the conditions of the hybrid method met?
#observed mean
mean(north)-mean(south)
## [1] -10.83333
#bootstrap mean
mean(pinesBootstrap)
## [1] -10.83151
#normality
qqnorm(pinesBootstrap)
#There appears to be little bias in the bootstrap because the difference between the bootstrap mean (-10.83151) and the observed mean (-10.83333) is very small. Because the means are so similar, this shows that the bootstrap distribution is centered very closely to the observed statistic. Because of this, the hybrid method would be reliable because the bootstrap distribution appears to be Normal(as indicated by the qq-plot) and the bias is small.
#5f) Compare the bootstrap results with the usual two sample t confidence interval. How do the intervals differ? Which would you use?
t.test(north, south)
##
## Welch Two Sample t-test
##
## data: north and south
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.090199 -2.576468
## sample estimates:
## mean of x mean of y
## 23.70000 34.53333
#The usual two-sample t confidence interval has a larger range than the bootstrap confidence interval. I would use the bootstrap because it gives me a smaller and slightly more precise interval.