2a. This would not be true if we did not have an adequately large sample size.
This is not true; bootstrapped samples are created with replacement.
This is not true; we should use a sample size equal to the size of the original sample.
The samples are taken from the original sample, not from the population.
3a.
set.seed(1)
treatement_group <- c(57,61)
control_group <- c(42,62,41,28)
diff <- mean(treatement_group) - mean(control_group)
diff
## [1] 15.75
sample <- sample(c(treatement_group,control_group),2)
print(c('diff is',mean(sample) - mean(c(treatement_group,control_group)[!c(treatement_group,control_group) %in% sample])))
## [1] "diff is" "16.5"
test_statistics = 1:20
for(i in 1:20) {
sample <- sample(c(treatement_group,control_group),2)
test_statistics[i] <-mean(sample) - mean(c(treatement_group,control_group)[!c(treatement_group,control_group) %in% sample])
}
hist(test_statistics)
proportion <- length(test_statistics[test_statistics >= diff]) / length(test_statistics)
proportion
## [1] 0.3
Here is the number of permutations with a statistic values greater than or equal to the original value, and then the exact p-value, respectively.
full_sample <- c(control_group, treatement_group)
count <- 0
for(i in 1:6) {
for(j in 1:6) {
if(i != j) {
treatement_group_perm <- full_sample[c(i,j)]
control_group_perm <- full_sample[!full_sample %in% treatement_group_perm]
perm_diff <- mean(treatement_group_perm) - mean(control_group_perm)
if(perm_diff >= diff) {
count <- count + 1
}
}
}
}
count
## [1] 6
count / 15
## [1] 0.4
My estimate was off by 10%, so it was fairly accurate but not perfect.
trees <- read.csv("nspines.csv")
plot(trees)
I think that it is appropriate to use t procedures. The boxplots look approximately normally distributed, and n = 60 so we have a reasonably large sample size.
test_statistic = mean(trees$dbh[trees$ns == 'n']) - mean(trees$dbh[trees$ns == 's'])
test_statistic
## [1] -10.83333
boot_sample <- bootStrapCI2(trees$dbh[trees$ns == 'n'],trees$dbh[trees$ns == 's'], 1000)
hist(boot_sample)
#quantile
quantile_method_result <- quantile(boot_sample, c(0.025, 0.975))
sprintf("quantile CI: 2.5%% is %f, 97.5%% is %f", quantile_method_result[1], quantile_method_result[2])
## [1] "quantile CI: 2.5% is -18.798833, 97.5% is -2.898583"
#hybrid
hybrid_method_result <- (test_statistic)+c(-1,1)*qt(0.975, df=14)*sd(boot_sample)
sprintf("hybrid CI: 2.5%% is %f, 97.5%% is %f", hybrid_method_result[1], hybrid_method_result[2])
## [1] "hybrid CI: 2.5% is -19.762170, 97.5% is -1.904496"
I do not think that these confidence intervals would be reliable. Looking at the histogram, it does not appear to be normal; the data looks to be skewed to the right.
t.test(trees$dbh[trees$ns == 'n'],trees$dbh[trees$ns == 's'])
##
## Welch Two Sample t-test
##
## data: trees$dbh[trees$ns == "n"] and trees$dbh[trees$ns == "s"]
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.090199 -2.576468
## sample estimates:
## mean of x mean of y
## 23.70000 34.53333
The bootstrapped confidence intervals are very similar to the one produced by the traditional t test. The upper bound of the traditional t test interval is slightly larger than either bootstrapped test, and the lower bound falls between the lower bound produced by the hybrid method and the quantile method. I would opt to uses the traditional t test, since the histogram of the bootstrapped sample wasn’t normally distributed, resulting in a flawed test.