Problem 2

a) It is the standard error of the bootstrap distribution that will be approximately the same as the standard error of the original sample, not the standard deviation.

b) The bootstrap distribution IS created WITH replacement from the original sample.

c) False, it is best to use samples larger than the size of the original sample.

d) False, the bootstrap distributioni s created by resampling with replacement from the SAMPLE population.

Problem 3

a) 57 + 61 = 118, 118/2 = 59. 42 + 62 + 41 + 28 = 173, 173/4 = 43.25. 59 - 43.25 = 15.75.

b)

SRS<-c(57, 61, 42, 62, 41, 28)
sample(SRS, 6, replace = FALSE)
## [1] 61 41 28 62 42 57

Treatment group: 41, 61. Mean = 51.5

Control group: 42, 62, 41, 28. Mean = 43.25

Difference in means = 8.25

c)

SRS<-c(57, 61, 42, 62, 41, 28)
replicate(20, (sample(SRS, 6, replace = FALSE)))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]   28   62   57   57   61   61   57   28   28    41    57    61    28
## [2,]   41   42   62   28   42   41   41   61   62    62    61    57    61
## [3,]   42   41   61   61   62   42   62   42   42    57    28    28    42
## [4,]   62   57   42   42   41   57   42   62   41    42    41    62    62
## [5,]   57   28   28   41   57   28   28   41   57    61    42    42    57
## [6,]   61   61   41   62   28   62   61   57   61    28    62    41    41
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]    42    61    28    57    62    61    41
## [2,]    28    62    57    62    61    42    62
## [3,]    61    41    61    41    42    57    42
## [4,]    57    42    41    61    57    28    61
## [5,]    41    28    62    42    41    41    28
## [6,]    62    57    42    28    28    62    57
hist(c(49-48.25, 51.5-47, 35-55.25, 44.5-50.5, 51.5-47, 35-55.25, 42.5-51.5, 49-48.25, 35-55.25, 49-48.25, 59-43.25, 52-46.75, 45-50.25, 59-43.25, 34.5-55.5, 49.5-47.75, 51-47.25, 51.5-47, 52-46.75, 35-55.25))

d)

x<-c(49-48.25, 51.5-47, 35-55.25, 44.5-50.5, 51.5-47, 35-55.25, 42.5-51.5, 49-48.25, 35-55.25, 49-48.25, 59-43.25, 52-46.75, 45-50.25, 59-43.25, 34.5-55.5, 49.5-47.75, 51-47.25, 51.5-47, 52-46.75, 35-55.25)
x[x>=15.75]
## [1] 15.75 15.75

10% of scores were greater than or equal to 15.75.

e)

x<-c(49-48.25, 51.5-47, 35-55.25, 44.5-50.5, 51.5-47, 35-55.25, 42.5-51.5, 49-48.25, 35-55.25, 49-48.25, 59-43.25, 52-46.75, 45-50.25, 59-43.25, 34.5-55.5, 49.5-47.75, 51-47.25, 51.5-47, 52-46.75, 35-55.25)
x[x>=15.75]
## [1] 15.75 15.75
2/15
## [1] 0.1333333

The exact p-vaue is 0.13. Thi is 0.03 off of my estimate.

Problem 4

Part 1

a)

library(readr)
calls80 <- read_csv("calls80.csv")
## Parsed with column specification:
## cols(
##   length = col_double()
## )
calls<-(calls80$length)
hist(calls)

#### The distribution is skewed to the right.

b)

bootstrapCalls<-function(data, nsim){
  n<- length(data)
  bootCI<-c()
  
  for (i in 1:nsim){
    bootSamp=sample(1:n, n, replace=TRUE)
    thisXbar<-mean(data[bootSamp])
    bootCI=c(bootCI, thisXbar)
  }
  return(bootCI)
}
hist(bootstrapCalls(calls,1000))

c)

qqnorm(bootstrapCalls(calls,1000))

#### After assessing the qqPlot it appears that the tails containe values greater than would be expected in a normal distribution.

Part 2

d)

callsSRS=c(104, 102, 35, 211, 56, 325, 67, 9, 179, 59)
bootstrapCalls<-function(data, nsim){
  n<- length(data)
  bootCI<-c()
  
  for (i in 1:nsim){
    bootSamp=sample(1:n, n, replace=TRUE)
    thisXbar<-mean(data[bootSamp])
    bootCI=c(bootCI, thisXbar)
  }
  return(bootCI)
}
hist(bootstrapCalls(callsSRS,1000))

qqnorm(bootstrapCalls(callsSRS,1000))

#### This distribution appears to be farther away from normal when assessing the qqPlot.

e)

sd(bootstrapCalls(calls,10000))
## [1] 37.615
sd(bootstrapCalls(callsSRS,10000))
## [1] 29.16442

I’m confused, the se for SRS sample is smaller, not larger. We would expect that the standard error of the smaller SRS would be larger because of the much smaller sample size = 10, as opposed to the sample of 80. However, R seems/a simulation error seems to disagree. Please let me know if you’re able to detect an error in the code! Thank you.

Problem 5

library(readr)
nspines <- read_csv("/Volumes/raenlow/MATH239/nspines.csv")
## Parsed with column specification:
## cols(
##   ns = col_character(),
##   dbh = col_double()
## )
a)
Npine= nspines[1:30, ]
Spine =nspines[31:60, ]
#names(nspines)
library(tidyverse)
ggplot(nspines, aes(dbh))+
  geom_histogram(bins=7)+
  facet_wrap(~ns)

ggplot(nspines, aes(y=dbh, x=ns, fill=ns))+
  geom_boxplot()

#### Neither distribution appears to be approximately normal and the sample sizes of n=30 are barely of sufficient size if going by the rule of n=30. From these observations it may not be reasonable to use standard t procedures.

b)

meanspine=mean(Spine$dbh)
meannpine=mean(Npine$dbh) 
meannpine-meanspine
## [1] -10.83333

c)

bootStrapCI2<-function(data1, data2, nsim){
  
  n1<-length(data1)
  n2<-length(data2)
  
  bootCI2<-c()
  
  for(i in 1:nsim){
    bootSamp1<-sample(1:n1, n1, replace=TRUE)
    bootSamp2<-sample(1:n2, n2, replace=TRUE)
    thisXbar<-mean(data1[bootSamp1])-mean(data2[bootSamp2])
    bootCI2<-c(bootCI2, thisXbar)
  }
  
  return(bootCI2)
}
pinebootCI=bootStrapCI2(Npine$dbh, Spine$dbh, 1000)
hist(pinebootCI)

d)

Quantile Method

quantile(pinebootCI, c(0.025, 0.975))
##      2.5%     97.5% 
## -18.57533  -2.77775

Hybrid Method

bootSE=sd(pinebootCI)
(mean(Npine$dbh)-mean(Spine$dbh))+c(-1,1)*qt(0.975, df = 58)*bootSE
## [1] -18.749693  -2.916974

e)

mean(pinebootCI)-(meannpine-meanspine)
## [1] -0.1196967

The bias of 0.09 seems to be relatively small and the distribution does appear approxmiately normal. The conditions for the hybrid method appear to be met and thus I do believe this interal is reliable.

f)

t.test(Npine$dbh, Npine$dbh)
## 
##  Welch Two Sample t-test
## 
## data:  Npine$dbh and Npine$dbh
## t = 0, df = 58, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.044798  9.044798
## sample estimates:
## mean of x mean of y 
##      23.7      23.7

Our usual two-sample t confidence interval is -9.044, 9.044, where as our bootstrap confidence intervals were -18.83, -2.83 and -18.61, -2.82. The bootstrap intervals have a range of about 16 that include the mean of the distrubtion, where the usual t test confidence interval has a range of about 18 and doesn’t include the mean. Because the boot distribution is closer to normal, I would choose the boot distrubtion.