Problem 2.

A. This is false because the standard deviation of the original sample will be larger than that of the bootstrap distribution. The bootstrap distribution includes sample means will the original sample includes individual observations, which have more variance than sample means.

B. This is false because the bootstrap distribution is created by resampling with replacement from the original sample, not without.

C. This is false because it is best to use the same sample size as the original sample so that it represents the distribution accurately.

D. This is false because the bootstrap distribution is created by resampling with replacement not from the population, but from the original sample.

Problems 3. a.

Treatment <- c(57, 61)
Control <- c(42, 62, 41, 28)
Difference_Means <- mean(Treatment) - mean(Control)
Difference_Means
## [1] 15.75

The difference in mean between the two groups is 15.75.

set.seed(123)
orig_sample <- c(Treatment, Control)
sample(orig_sample, 2)
## [1] 42 28
mean_Treatment_sample <- mean(c(42,28))
mean_Control_sample <- mean(c(57,61,62,41))
mean_Treatment_sample - mean_Control_sample
## [1] -20.25

The difference in group means for this sample is -20.25.

set.seed(111)
nsim = 20
n1 <- length(Treatment)
n2 <- length(Control)

permNull <- c()

for (i in 1:nsim) {
  permSamp <- sample(1:(n1+n2), n1, replace=FALSE)
  thisXbar <- mean(orig_sample[permSamp]) - mean(orig_sample[-permSamp])
  permNull <- c(permNull, thisXbar)
}

hist(permNull)

permNull
##  [1] -20.25   5.25   1.50 -10.50  19.50   0.75   3.75  19.50   1.50   1.50
## [11]   0.75  -9.00   5.25  -5.25  -9.00   4.50   4.50 -21.00  -5.25  16.50

4/20 were equal to or greater than the original value in part(a). The esimtaed p-value is 0.20.

4/15 possible permuations result in a statistic greater than or equal to our orginal value. The p-value = 0.2666.

4a.

#install.packages("readr")
#library(readr)

nspines <- read.csv("nspines.csv", head = TRUE)
boxplot(nspines$dbh ~ nspines$ns, xlab = "Region", ylab = "DBH")

There appears to be a signifcant amount of right skew for the northern region sample and a significant amount of left skew for the southern region sample. Whilst a sample size of 30 is not horrbly low, I could defintely be larger in order allow for more certainty and true representation of the data. I think using a bootstrap would therefore be useful in this case.

nspines_n <- nspines[1:30,]
nspines_s <- nspines[31:60,]
north_mean <- mean(nspines_n$dbh)
south_mean <- mean(nspines_s$dbh)
north_mean - south_mean
## [1] -10.83333
bootstrapCI2 <- function(data1, data2, nsim) {
  n1 <- length(data1)
  n2 <- length(data2)
  
  bootCI2 <- c()

  for (i in 1:nsim) {
    bootSamp1 <- sample(1:n1, n1, replace = TRUE)
    bootSamp2 <-sample(1:n2, n2, replace = TRUE)
    thisXbar <- mean(data1[bootSamp1])-mean(data2[bootSamp2])
    bootCI2 <- c(bootCI2, thisXbar)
    
  } 
  
  return(bootCI2)
  
} 
nspinesBootCI2 <- bootstrapCI2(nspines_n$dbh, nspines_s$dbh, nsim = 1000)
hist(nspinesBootCI2)

quantile(nspinesBootCI2, c(.025, .975))
##       2.5%      97.5% 
## -18.358000  -1.810583
se <- sd(nspinesBootCI2)
mean(north_mean)-mean(south_mean)+c(-1,1)*qt(.975, df=58)*se
## [1] -19.216413  -2.450254
north_mean - south_mean 
## [1] -10.83333
mean(nspinesBootCI2)
## [1] -10.74486
qqnorm(nspinesBootCI2)

The test statistic (-10.83333) is very close to the bootstrap mean (-10.81986) suggesting that there is low bias. Because the bootstrap distribution is relatively normal and there is symmetry, we can proceede with the hybrid method.

t.test(nspines_n$dbh, nspines_s$dbh)
## 
##  Welch Two Sample t-test
## 
## data:  nspines_n$dbh and nspines_s$dbh
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.090199  -2.576468
## sample estimates:
## mean of x mean of y 
##  23.70000  34.53333