Turn in your bootstrap packet.
Explain what is wrong with each of the following statements.
Since the bootstrap distribution is a loop of resamlples from the original data, it is not expected that the two standard deviations will be approximately the same. Theoretically they could be similar just based on the random resamples however, this is not consistently expected.
While it is true that the bootstrp distribution is created by resampling from the original sample, this process is done with replacement. The permutation method is done without replacement. In some cases when using a bootstrap distribution with two samples/data sets, the data has to be merged into a combined sample.
When generating a resample, you want to use a sample size that is larger than the orinigal sample to ensure that the Central Limit Theorem applies and that the distribution will be roughly Normal. You generate a resample in the first place because the original sample failed this assumption
While it is true that the bootstrap distribution is created by resampling with replacement, this process is done from a sample not the population. Samples are used in the first place when it is difficult or impossible to measure the entire population.
treatment<-c(57, 61)
control<-c(42, 62, 41, 28)
test_stat<-mean(treatment)-mean(control)
test_stat
## [1] 15.75
total<-c(57, 61, 42, 62, 41, 28)
sample(total, 2, replace=FALSE)
## [1] 61 57
treatmentB<-c(61, 62)
controlB<-c(57, 42, 41, 28)
test_statB<-mean(treatmentB)-mean(controlB)
treatment1<-c(41, 62)
control1<-c(57, 61, 42, 28)
test_stat1<-mean(treatment1)-mean(control1)
test_stat1
## [1] 4.5
treatment2<-c(57, 42)
control2<-c(62, 61, 41, 28)
test_stat2<-mean(treatment2)-mean(control2)
test_stat2
## [1] 1.5
treatment3<-c(42, 61)
control3<-c(62, 57, 41, 28)
test_stat3<-mean(treatment3)-mean(control3)
test_stat3
## [1] 4.5
treatment4<-c(42, 61)
control4<-c(62, 57, 41, 28)
test_stat4<-mean(treatment4)-mean(control4)
test_stat4
## [1] 4.5
treatment5<-c(62, 61)
control5<-c(42, 57, 41, 28)
test_stat5<-mean(treatment5)-mean(control5)
test_stat5
## [1] 19.5
treatment6<-c(42, 62)
control6<-c(61, 57, 41, 28)
test_stat6<-mean(treatment6)-mean(control6)
test_stat6
## [1] 5.25
treatment7<-c(61, 57)
control7<-c(28, 42, 62, 41)
test_stat7<-mean(treatment7)-mean(control7)
test_stat7
## [1] 15.75
treatment8<-c(57, 62)
control8<-c(41, 42, 28, 61)
test_stat8<-mean(treatment8)-mean(control8)
test_stat8
## [1] 16.5
treatment9<-c(57, 61)
control9<-c(62, 41, 42, 28)
test_stat9<-mean(treatment9)-mean(control9)
test_stat9
## [1] 15.75
treatment10<-c(62, 61)
control10<-c(41, 42, 57, 28)
test_stat10<-mean(treatment10)-mean(control10)
test_stat10
## [1] 19.5
treatment11<-c(62, 41)
control11<-c(28, 42, 57, 61)
test_stat11<-mean(treatment11)-mean(control11)
test_stat11
## [1] 4.5
treatment12<-c(42, 41)
control12<-c(62, 61, 57, 28)
test_stat12<-mean(treatment12)-mean(control12)
test_stat12
## [1] -10.5
treatment13<-c(28, 41)
control13<-c(42, 62, 61, 57)
test_stat13<-mean(treatment13)-mean(control13)
test_stat13
## [1] -21
treatment14<-c(62, 41)
control14<-c(61, 57, 28, 42)
test_stat14<-mean(treatment14)-mean(control14)
test_stat14
## [1] 4.5
treatment15<-c(62, 28)
control15<-c(61, 41, 42, 57)
test_stat15<-mean(treatment15)-mean(control15)
test_stat15
## [1] -5.25
treatment16<-c(57, 61)
control16<-c(62, 41, 42, 28)
test_stat16<-mean(treatment16)-mean(control16)
test_stat16
## [1] 15.75
treatment17<-c(57, 41)
control17<-c(62, 61, 28, 42)
test_stat17<-mean(treatment17)-mean(control17)
test_stat17
## [1] 0.75
treatment18<-c(57, 42)
control18<-c(62, 61, 41, 28)
test_stat18<-mean(treatment18)-mean(control18)
test_stat18
## [1] 1.5
treatment19<-c(41, 57)
control19<-c(62, 61, 42, 28)
test_stat19<-mean(treatment19)-mean(control19)
test_stat19
## [1] 0.75
treatment20<-c(62, 28)
control20<-c(61, 57, 41, 42)
test_stat20<-mean(treatment20)-mean(control20)
test_stat20
## [1] -5.25
TEST_STAT<-c(test_stat1, test_stat2, test_stat3, test_stat4, test_stat5, test_stat6, test_stat7, test_stat8, test_stat9, test_stat10, test_stat11, test_stat12, test_stat13, test_stat14, test_stat15, test_stat16, test_stat17, test_stat18, test_stat19, test_stat20)
hist(TEST_STAT)
All this down by hand with use of R functions. I’m having trouble setting up a one sample permutation, even though this problem was suppose to be done by hand anyway. Also I might be doing it wrong but doesnt (-) mean everything?
TEST_STAT>=15.75
## [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
## [12] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
6/15
## [1] 0.4
40% of the 20 statistic values were equal to or greate than the original value in part a).
calls<-read.csv("calls80 (2).csv", header=TRUE, na.strings="?")
hist(calls$length)
The distribution of call lengths is strongly skewed right with the center (median) in the domain from 0-500. There is a possible outlier(s) in the domain of 2500-3000.
bootStrapCI<-function(data, nsim){
n<-length(data)
bootCI<-c()
for(i in 1:nsim){
bootSamp<-sample(1:n, n, replace=TRUE)
Xbar<-mean(data[bootSamp])
bootCI<-c(bootCI, Xbar)
}
return(bootCI)
}
callBootStrap<-bootStrapCI(calls$length, nsim=1000)
hist(callBootStrap)
For n=80 with a bootstrap that uses nsim=1000, the sampling distribution appears to be roughly normal.
qqnorm(callBootStrap)
qqline(callBootStrap)
You can see on the QQ-Plot that the data gains distance from the line, both at the beginning and end of the data. This illustrates how the tails of the histogram depart form normality.
calls2<-c(104, 102, 35, 211, 56, 325, 67, 9, 179, 59)
call2BootStrap<-bootStrapCI(calls2, nsim=1000)
hist(call2BootStrap)
qqnorm(call2BootStrap)
qqline(call2BootStrap)
This distribution holds a similar normality structure to part c) as the tails seem to depart from normality but it seems closer to being Normal as the distance from the line is slightly less.
SE1<-sd(callBootStrap)/sqrt(80)
SE1
## [1] 4.181344
SE2<-sd(call2BootStrap)/sqrt(10)
SE2
## [1] 9.316676
The standard error is larger for the smaller SRS because the numerator of the SE equation is naturally smaller with a smaller n making the overall SE larger after being divided with the standard deviation. Also, a smaller SRS might have data more spread out when compared to a larger SRS which has a higher chance of more accurate data. While this second concept is a possiblity it is not a for sure feat of differing sample sizes.
trees<-read.csv("nspines (1).csv", header=TRUE, na.strings="?'")
head(trees)
## ns dbh
## 1 n 27.8
## 2 n 14.5
## 3 n 39.1
## 4 n 3.2
## 5 n 58.8
## 6 n 55.5
plot(trees$ns, trees$dbh)
As the side-by-side boxplots do not show a symmetirc structure that might suggest a roughly normal distribution, it does not appear reasonanle to use a standard t procedure. Also since the bowplots do not show modality, it is uncertain how many peaks the distributions could have and whether this would affect normality statements.
obs_stat<-mean(trees$dbh[1:30])-mean(trees$dbh[31:60])
obs_stat
## [1] -10.83333
bootStrapCI2<-function(data1, data2, nsim){
n1<-length(data1)
n2<-length(data2)
bootCI2<-c()
for(i in 1:nsim){
bootSamp1<-sample(1:n1, n1, replace=TRUE)
bootSamp2<-sample(n1:n2, n2, replace=TRUE)
newXbar<-mean(data1[bootSamp1])-mean(data2[bootSamp2])
bootCI2<-c(bootCI2, newXbar)
}
return(bootCI2)
}
bootStrapTrees<-bootStrapCI2(trees$dbh[1:30], trees$dbh[31:60], nsim=10000)
hist(bootStrapTrees)
# Quantile Method
quantile(bootStrapTrees, c(.025, .975))
## 2.5% 97.5%
## -18.353417 -2.733333
#Hybrid Method
se<-sd(bootStrapTrees)
obs_stat+c(-1,1)*qt(.975, df=29)*se
## [1] -19.027677 -2.638989
Based on each sample (North and South) having n=30 and the histogram of bootStrapTrees appearing to be roughly normal, the Hybrid Method seems to a reliable method to create an interval based on the middle 95%.
seT<-sqrt((sd(trees$dbh[1:30])/30)+sd(trees$dbh[31:60])/30)
obs_stat+c(-1,1)*qt(.975, df=29)*seT
## [1] -12.937650 -8.729016
The usual two-sample t confidence interval is much smaller then either the Quantile or Hybrid bootstrap method although all three intervals are completely negative. I would use either the Quantile and Hybrid Method as they seem more accurate having drawn from the results of 10000 simulations compared to a small sample size of barelty 30 (typical t confidence interval).