Homework 3

Problem 2

(a) The standard deviation of the boostrap distribution will be approximately the same as the standard deviation of the original sample.

FALSE: The standard deviation of the original sample will be much larger than that of the boostrap distribution. This is because the SD of the sample includes single observations whereas the boostrap distribution contains sample means. Single observations will vary much more than sample means.

(b) The bootstrap distribution is created by resampling without replacement from the original sample.

FALSE: The bootstrap distribution is created by resampling WITH replacement.

(c) When generating the resamples, it is best to use a sample size smaller than the size of the original sample.

FALSE: The resamples should be the same sample size as the original sample for it to be a representative distribution.

(d) The bootstrap distribution is created by resampling with replacement from the population.

FALSE: The bootstrap distribution is created by resampling with replacement from the original sample, not the population.

Problem 3

(a) Calculate the difference in means treatment - control between the two groups. This is the observed value of the statistic.

Treatment <- c(57, 61)
Control <- c(42, 62, 41, 28)
Dif_Means <- mean(Treatment) - mean(Control)
Dif_Means

## [1] 15.75

(b) What is the difference between the two groups

set.seed(123)
Full_samp <- c(Treatment, Control)
sample(Full_samp, 2)

## [1] 42 28

mean_treatment_sample <- mean(c(42,28))
mean_control_sample <- mean(c(57,61,62,41))
mean_treatment_sample - mean_control_sample

## [1] -20.25

The difference between group means is -20.25.

(c) Repeat 20 times. Make a histogram of the distribution.

set.seed(111)
nsim=20
n1 <- length(Treatment)
n2 <- length(Control)
permNull <- c()

for(i in 1:nsim){
  permSamp <- sample(1:(n1+n2), n1, replace=FALSE)
  thisXbar <- mean(Full_samp[permSamp]) - mean(Full_samp[-permSamp])
  permNull <- c(permNull, thisXbar)
}

hist(permNull)

(d) What proportion of the 20 statistic values were equal to or greater than the original value in part(a)?

permNull

##  [1] -20.25   5.25   1.50 -10.50  19.50   0.75   3.75  19.50   1.50   1.50
## [11]   0.75  -9.00   5.25  -5.25  -9.00   4.50   4.50 -21.00  -5.25  16.50

3/20. Estimated p-value = .15

(e) Calculate exact p-value.

Three out of fifteen possibilities lead to a statistic greater than or equal to our original value. p-value = .2 (pretty close to estimate)

Problem 4

Part 1 (a) Make a histogram of the call lengths. Describe the shape of the distribution.

setwd("H:/MATH239")
calls <- read.csv("calls80.csv", header = TRUE)
hist(calls$length)

The distribution is heavily skewed to the right.

(b) Bootstrap these data using 1000 resamples.

bootstrapCI1 <- function(data, nsim){
  n <- length(data)
  bootCI <- c()
  for(i in 1:nsim){
    bootSamp <- sample(1:n, n, replace=TRUE) 
    thisXbar <- mean(data[bootSamp]) 
    bootCI <- c(bootCI, thisXbar) 
  }
  return(bootCI)
}

callsBootCI <- bootstrapCI1(calls$length, nsim=1000)

hist(callsBootCI)

qqnorm(callsBootCI)

The bootstrapped sampling distribution is close to normal, yet the tails depart from normality.

(c) In what ways do the tails depart from normality?

The qqplot indicates that the the bootstrapped sampling distribution has a right skew.

Part II

(d) Create bootstrapped samples for SRS n=10 of the data. Is this distribution closer to or farther away from normal?

SRS_calls <- c(104,102,35,211,56,325,67,9,179,59)

bootstrapCI1 <- function(data, nsim){
  n <- length(data)
  bootCI <- c()
  for(i in 1:nsim){
    bootSamp <- sample(1:n, n, replace=TRUE) 
    thisXbar <- mean(data[bootSamp]) 
    bootCI <- c(bootCI, thisXbar) 
  }
  return(bootCI)
}

SRS_callsBootCI <- bootstrapCI1(SRS_calls, nsim=1000)

hist(SRS_callsBootCI)

qqnorm(SRS_callsBootCI)

It is similarly close to normal and has the same right skew.

(e) Compare the bootstrap standard errors for your two sets of resamples. Why is it larger for the smaller SRS?

std.er_callsBoot <- sd(callsBootCI)/sqrt(80)
std.er_callsBoot

## [1] 4.16038

std.er_SRScallsBoot <- sd(SRS_callsBootCI)/sqrt(10)
std.er_SRScallsBoot

## [1] 8.85995

It is larger for the smaller SRS because it has a smaller sample size. Since each sample mean is derived from a smaller number of individual values, there will be more variation among boostrapped sample means. Standard error is calculated by dividing the sd of a sample by the square root of the sample size. A small sample size makes the denominator smaller which makes the output bigger.

Problem 5

(a) Examine the data graphically with a boxplot. Does it seem reasonable to use standard t procedures?

setwd("H:/MATH239")
nspines <- read.csv("nspines.csv", header = TRUE)
boxplot(nspines$dbh ~ nspines$ns, xlab = "Region", ylab = "DBH")

It appears that the northern region is heavily skewed to the right, whereas the southern region is heavily skewed to the left. Each group has a sample size of 30 which is the bare minimum to employ the central limit theorem. However, given the sample size being barely large enough, and the heavy skew of both groups, it would be much better to use bootstrapping.

(b) Calculate our observed statistic

nspines_north <- nspines[1:30,]
nspines_south <- nspines[31:60,]
north_mean <- mean(nspines_north$dbh)
south_mean <- mean(nspines_south$dbh)
north_mean - south_mean

## [1] -10.83333

(c) Bootstrap the difference in means and look at the bootstrap distribution.

bootstrapCI2 <- function(data1, data2, nsim){
  n1 <- length(data1)
  n2 <- length(data2)
  
  bootCI2 <- c()
  
  for(i in 1:nsim){
    bootSamp1 <- sample(1:n1, n1, replace=TRUE) 
    bootSamp2 <- sample(1:n2, n2, replace=TRUE)
    thisXbar <- mean(data1[bootSamp1])-mean(data2[bootSamp2])
    bootCI2 <- c(bootCI2, thisXbar) 
  }
  
  return(bootCI2)
}

nspinesBootCI2 <- bootstrapCI2(nspines_north$dbh, nspines_south$dbh, nsim=10000)
hist(nspinesBootCI2)

(d) Calculate the quantile and hybrid confidence intervals.

quantile(nspinesBootCI2, c(.025, .975))

##       2.5%      97.5% 
## -18.736833  -2.916667

se <- sd(nspinesBootCI2)
mean(north_mean)-mean(south_mean)+c(-1,1)*qt(.975, df=58)*se

## [1] -19.038557  -2.628109

(e) Comment on whether the conditions for the hybrid method are met. Do you beleive this interval would be reliable?

north_mean - south_mean # test statistic

## [1] -10.83333

mean(nspinesBootCI2) # CI mean

## [1] -10.86894

qqnorm(nspinesBootCI2)

Since the test statistic is close to the bootstrap mean (which indicates low bias) and the bootstrap distribution is approximately normal, the hybrid method should be reliable.

(f) Compare the bootstrap results with the usual two-sample t confidence interval. How do the intervals differ? Which would you use?

t.test(nspines_north$dbh, nspines_south$dbh)

## 
##  Welch Two Sample t-test
## 
## data:  nspines_north$dbh and nspines_south$dbh
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.090199  -2.576468
## sample estimates:
## mean of x mean of y 
##  23.70000  34.53333

The bootstrap CI is smaller which indicates that it is a more precise interval. Therefore, I would use the bootstrap interval.