Homework 3

2a. This would not be true if we did not have an adequately large sample size.

This is not true; bootstrapped samples are created with replacement.
This is not true; we should use a sample size equal to the size of the original sample.
The samples are taken from the original sample, not from the population.

3a.

set.seed(1)
treatement_group <- c(57,61)
control_group <- c(42,62,41,28)
diff <- mean(treatement_group) - mean(control_group)
diff

## [1] 15.75

sample <- sample(c(treatement_group,control_group),2)
print(c('diff is',mean(sample) - mean(c(treatement_group,control_group)[!c(treatement_group,control_group) %in% sample])))

## [1] "diff is" "16.5"

test_statistics = 1:20
for(i in 1:20) {
  sample <- sample(c(treatement_group,control_group),2)
  test_statistics[i] <-mean(sample) -  mean(c(treatement_group,control_group)[!c(treatement_group,control_group) %in% sample])
}
hist(test_statistics)

Here is the proportion of observations that were equal to or greater than the value from a:

proportion <- length(test_statistics[test_statistics >= diff]) / length(test_statistics)
proportion

## [1] 0.3

Here is the number of permutations with a statistic values greater than or equal to the original value, and then the exact p-value, respectively.

full_sample <- c(control_group, treatement_group)
count <- 0
for(i in 1:6) {
  for(j in 1:6) {
    if(i != j) {
      treatement_group_perm <- full_sample[c(i,j)]
      control_group_perm <- full_sample[!full_sample %in% treatement_group_perm]
      perm_diff <- mean(treatement_group_perm) - mean(control_group_perm)
      if(perm_diff >= diff) {
        count <- count + 1
      }
    }
  }
}
count

## [1] 6

count / 15

## [1] 0.4

My estimate was off by 10%, so it was fairly accurate but not perfect.

trees <- read.csv("nspines.csv")
plot(trees)

I think that it is appropriate to use t procedures. The boxplots look approximately normally distributed, and n = 60 so we have a reasonably large sample size.

test_statistic = mean(trees$dbh[trees$ns == 'n']) - mean(trees$dbh[trees$ns == 's'])
test_statistic

## [1] -10.83333

boot_sample <- bootStrapCI2(trees$dbh[trees$ns == 'n'],trees$dbh[trees$ns == 's'], 1000)
hist(boot_sample)

#quantile
quantile_method_result <- quantile(boot_sample, c(0.025, 0.975))
sprintf("quantile CI: 2.5%% is %f, 97.5%% is %f", quantile_method_result[1], quantile_method_result[2])

## [1] "quantile CI: 2.5% is -18.798833, 97.5% is -2.898583"

#hybrid
hybrid_method_result <- (test_statistic)+c(-1,1)*qt(0.975, df=14)*sd(boot_sample)
sprintf("hybrid CI: 2.5%% is %f, 97.5%% is %f", hybrid_method_result[1], hybrid_method_result[2])

## [1] "hybrid CI: 2.5% is -19.762170, 97.5% is -1.904496"

I do not think that these confidence intervals would be reliable. Looking at the histogram, it does not appear to be normal; the data looks to be skewed to the right.

t.test(trees$dbh[trees$ns == 'n'],trees$dbh[trees$ns == 's'])

## 
##  Welch Two Sample t-test
## 
## data:  trees$dbh[trees$ns == "n"] and trees$dbh[trees$ns == "s"]
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.090199  -2.576468
## sample estimates:
## mean of x mean of y 
##  23.70000  34.53333

The bootstrapped confidence intervals are very similar to the one produced by the traditional t test. The upper bound of the traditional t test interval is slightly larger than either bootstrapped test, and the lower bound falls between the lower bound produced by the hybrid method and the quantile method. I would opt to uses the traditional t test, since the histogram of the bootstrapped sample wasn’t normally distributed, resulting in a flawed test.

Homework 3

Daniel Smith

2/18/2020