The standard deviation of the bootstrap distribution will be approximately the same as the standard deviation of the original sample. Thi is not true because bootstrap distribution uses computer simulation to resample from the sample population with replacement many times.
The bootstrap distributino is created by resampling without replacement from the original sample. True
When generating the resamples, it is best to use a sample size smaller than the size of the original sample. The number of observations or sample size that we randomly select must be the same size as the number of the original sample obsevations. (Professor Knudson, 2017; https://www.youtube.com/watch?v=CeLlJ5EngYY)
The bootstrap distribution is created by resampling with replacement from the population. The bootstrap distribution is created by resampling with replacement from the original sample popultion. We hardly can estimate the population in statistics.
Do new “directed reading activities” improve the reading ability of elementary school students, as measured by their Degree of Reading Power” (DRP) scores? A study assigns students at random to either the new method (treatment group, 21 students) or traditional teaching methods (control group, 23 students).
To illustrate the process, let’s perform a permutation test by hand for a small random subset ofthe DRP data. Here is the subset: # Treatment Group 57 61 # Control Group 42 62 41 28
TreatG <- c(57, 61)
ContG <- c(42, 62, 41, 28)
mean(TreatG)-mean(ContG)
## [1] 15.75
n1 <- length(TreatG)
n2 <- length(ContG)
sumG <- TreatG+ContG
TreatGP <- sample(sumG, size=2, replace=FALSE)
mean(TreatGP)-mean(ContG)
## [1] 50.25
PermD <- c(55.25, 50.25, 50.75, 62.75, 67.75, 50.75, 67.25, 50.25, 50.75, 55.25, 67.25, 55.25, 50.75, 62.75, 50.75, 50.25, 62.75, 67.75, 67.75, 67.75)
hist(PermD)
All the values were much greater than the calculated value in (a).
pvalvector <- c(55.25/15, 50.25/15, 50.75/15, 62.75/15, 67.75/15, 50.75/15, 67.25/15, 50.25/15, 50.75/15, 55.25/15, 67.25/15, 55.25/15, 50.75/15, 62.75/15, 50.75/15, 50.25/15, 62.75/15, 67.75/15, 67.75/15, 67.75/15)
sum(pvalvector)
## [1] 77.6
The p value above represents the statistics is not as close as the original population. THis is due to the small sample size of the population.
We want to compare the mean diameter at breast height (DBH) for trees from the northern and southern halves of a land tract using a random sample of 30 trees from each region. These data are available on the WISE site, named nspines.csv.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(readxl)
library(stringr)
nspines <- read_excel("nspines.xlsx")
str(nspines)
## Classes 'tbl_df', 'tbl' and 'data.frame': 60 obs. of 2 variables:
## $ ns : chr "n" "n" "n" "n" ...
## $ dbh: num 27.8 14.5 39.1 3.2 58.8 55.5 25 5.4 19 30.6 ...
summary(nspines)
## ns dbh
## Length:60 Min. : 2.20
## Class :character 1st Qu.:14.10
## Mode :character Median :29.85
## Mean :29.12
## 3rd Qu.:43.90
## Max. :58.80
ggplot(nspines, aes(x = ns, y = dbh, color = ns))+
geom_point()+
facet_wrap(~ ns)
xnorth <- nspines %>%
filter(ns == "n")
Xnorth <- xnorth$dbh
xsouth <- nspines %>%
filter(ns == "s")
Xsouth <- xsouth$dbh
mean(xnorth$dbh)-mean(xsouth$dbh)
## [1] -10.83333
n <- length(nspines)
nspinesBS<-function(Xnorth, Xsouth, nsim){
n1<-length(xnorth)
n2<-length(xsouth)
BS<-c()
for(i in 1:nsim){
bootSamp1<-sample(1:n1, n1, replace=TRUE)
bootSamp2<-sample(1:n2, n2, replace=TRUE)
thisXbar<-mean(Xnorth[bootSamp1])-mean(Xsouth[bootSamp2])
BS<-c(BS, thisXbar)
}
return(BS)
}
nspinesBootCI<-nspinesBS(Xnorth, Xsouth, nsim=10000)
t.test(nspinesBootCI)
##
## One Sample t-test
##
## data: nspinesBootCI
## t = -177.11, df = 9999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -14.26171 -13.94947
## sample estimates:
## mean of x
## -14.10559
Difference in mean is -13.98532
hist(nspinesBootCI)
# Quantile Method
quantile(nspinesBootCI, c(0.05, 0.95))
## 5% 95%
## -29.9 1.7
# Hybrid Method
sd(nspinesBootCI)
## [1] 7.964334
bootSE<-sd(nspinesBootCI)/sqrt(n)
bootSE
## [1] 5.631635
(mean(Xnorth)-mean(Xsouth))+c(-1.96,1.96)*qt(0.95, df=59)*bootSE
## [1] -29.278864 7.612198
Condition for the hybrid method might not meet because the bootstrap is based off of the sample data from the population. This means that the assumption of normality and bias is not necessarily the case for the sample distribution.
t.test(Xnorth, Xsouth)
##
## Welch Two Sample t-test
##
## data: Xnorth and Xsouth
## t = -2.6286, df = 55.725, p-value = 0.01106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.090199 -2.576468
## sample estimates:
## mean of x mean of y
## 23.70000 34.53333
23.70000-34.53333
## [1] -10.83333
Although the difference in mean is somewhat close to the difference in mean from the bootstrap distribution, the range of confidence interval is much tighter. We can conclude that bootstrap results is much reliable.