FALSE: The standard deviation of the original sample will be much larger than that of the boostrap distribution. This is because the SD of the sample includes single observations whereas the boostrap distribution contains sample means. Single observations will vary much more than sample means.
FALSE: The bootstrap distribution is created by resampling WITH replacement.
FALSE: The resamples should be the same sample size as the original sample for it to be a representative distribution.
FALSE: The bootstrap distribution is created by resampling with replacement from the original sample, not the population.
Treatment <- c(57, 61)
Control <- c(42, 62, 41, 28)
Dif_Means <- mean(Treatment) - mean(Control)
Dif_Means
## [1] 15.75
set.seed(123)
Full_samp <- c(Treatment, Control)
sample(Full_samp, 2)
## [1] 42 28
mean_treatment_sample <- mean(c(42,28))
mean_control_sample <- mean(c(57,61,62,41))
mean_treatment_sample - mean_control_sample
## [1] -20.25
The difference between group means is -20.25.
set.seed(111)
nsim=20
n1 <- length(Treatment)
n2 <- length(Control)
permNull <- c()
for(i in 1:nsim){
permSamp <- sample(1:(n1+n2), n1, replace=FALSE)
thisXbar <- mean(Full_samp[permSamp]) - mean(Full_samp[-permSamp])
permNull <- c(permNull, thisXbar)
}
hist(permNull)
permNull
## [1] -20.25 5.25 1.50 -10.50 19.50 0.75 3.75 19.50 1.50 1.50
## [11] 0.75 -9.00 5.25 -5.25 -9.00 4.50 4.50 -21.00 -5.25 16.50
3/20. Estimated p-value = .15
Three out of fifteen possibilities lead to a statistic greater than or equal to our original value. p-value = .2 (pretty close to estimate)
setwd("H:/MATH239")
calls <- read.csv("calls80.csv", header = TRUE)
hist(calls$length)
The distribution is heavily skewed to the right.
bootstrapCI1 <- function(data, nsim){
n <- length(data)
bootCI <- c()
for(i in 1:nsim){
bootSamp <- sample(1:n, n, replace=TRUE)
thisXbar <- mean(data[bootSamp])
bootCI <- c(bootCI, thisXbar)
}
return(bootCI)
}
callsBootCI <- bootstrapCI1(calls$length, nsim=1000)
hist(callsBootCI)
qqnorm(callsBootCI)
The bootstrapped sampling distribution is close to normal, yet the tails depart from normality.
The qqplot indicates that the the bootstrapped sampling distribution has a right skew.
SRS_calls <- c(104,102,35,211,56,325,67,9,179,59)
bootstrapCI1 <- function(data, nsim){
n <- length(data)
bootCI <- c()
for(i in 1:nsim){
bootSamp <- sample(1:n, n, replace=TRUE)
thisXbar <- mean(data[bootSamp])
bootCI <- c(bootCI, thisXbar)
}
return(bootCI)
}
SRS_callsBootCI <- bootstrapCI1(SRS_calls, nsim=1000)
hist(SRS_callsBootCI)
qqnorm(SRS_callsBootCI)
It is similarly close to normal and has the same right skew.
std.er_callsBoot <- sd(callsBootCI)/sqrt(80)
std.er_callsBoot
## [1] 4.16038
std.er_SRScallsBoot <- sd(SRS_callsBootCI)/sqrt(10)
std.er_SRScallsBoot
## [1] 8.85995
It is larger for the smaller SRS because it has a smaller sample size. Since each sample mean is derived from a smaller number of individual values, there will be more variation among boostrapped sample means. Standard error is calculated by dividing the sd of a sample by the square root of the sample size. A small sample size makes the denominator smaller which makes the output bigger.
setwd("H:/MATH239")
nspines <- read.csv("nspines.csv", header = TRUE)
boxplot(nspines$dbh ~ nspines$ns, xlab = "Region", ylab = "DBH")
It appears that the northern region is heavily skewed to the right, whereas the southern region is heavily skewed to the left. Each group has a sample size of 30 which is the bare minimum to employ the central limit theorem. However, given the sample size being barely large enough, and the heavy skew of both groups, it would be much better to use bootstrapping.
nspines_north <- nspines[1:30,]
nspines_south <- nspines[31:60,]
north_mean <- mean(nspines_north$dbh)
south_mean <- mean(nspines_south$dbh)
north_mean - south_mean
## [1] -10.83333
bootstrapCI2 <- function(data1, data2, nsim){
n1 <- length(data1)
n2 <- length(data2)
bootCI2 <- c()
for(i in 1:nsim){
bootSamp1 <- sample(1:n1, n1, replace=TRUE)
bootSamp2 <- sample(1:n2, n2, replace=TRUE)
thisXbar <- mean(data1[bootSamp1])-mean(data2[bootSamp2])
bootCI2 <- c(bootCI2, thisXbar)
}
return(bootCI2)
}
nspinesBootCI2 <- bootstrapCI2(nspines_north$dbh, nspines_south$dbh, nsim=10000)
hist(nspinesBootCI2)
quantile(nspinesBootCI2, c(.025, .975))
## 2.5% 97.5%
## -18.736833 -2.916667
se <- sd(nspinesBootCI2)
mean(north_mean)-mean(south_mean)+c(-1,1)*qt(.975, df=58)*se
## [1] -19.038557 -2.628109
north_mean - south_mean # test statistic
## [1] -10.83333
mean(nspinesBootCI2) # CI mean
## [1] -10.86894
qqnorm(nspinesBootCI2)
Since the test statistic is close to the bootstrap mean (which indicates low bias) and the bootstrap distribution is approximately normal, the hybrid method should be reliable.
t.test(nspines_north$dbh, nspines_south$dbh)
##
## Welch Two Sample t-test
##
## data: nspines_north$dbh and nspines_south$dbh
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.090199 -2.576468
## sample estimates:
## mean of x mean of y
## 23.70000 34.53333
The bootstrap CI is smaller which indicates that it is a more precise interval. Therefore, I would use the bootstrap interval.