Problem Set #3

Problem 1

Turn in your bootstrap packet.

Problem 2

Explain what is wrong with each of the following statements.

The standard deviation of the bootstrap distribution will be approximately the same as the standard deviation of the original sample.

Since the bootstrap distribution is a loop of resamlples from the original data, it is not expected that the two standard deviations will be approximately the same. Theoretically they could be similar just based on the random resamples however, this is not consistently expected.

The bootstrap distribution is created by resampling without replacement from the original sample.

While it is true that the bootstrp distribution is created by resampling from the original sample, this process is done with replacement. The permutation method is done without replacement. In some cases when using a bootstrap distribution with two samples/data sets, the data has to be merged into a combined sample.

When generating the resamples, it is best to use a sample size smaller than the size of the original sample.

When generating a resample, you want to use a sample size that is larger than the orinigal sample to ensure that the Central Limit Theorem applies and that the distribution will be roughly Normal. You generate a resample in the first place because the original sample failed this assumption

The bootstrap distribution is created by resampling with replacement from the population.

While it is true that the bootstrap distribution is created by resampling with replacement, this process is done from a sample not the population. Samples are used in the first place when it is difficult or impossible to measure the entire population.

Problem 3

Calculate the difference in means xtreatment − xcontrol between the two groups. This is the observed value of the statistic.

treatment<-c(57, 61)
control<-c(42, 62, 41, 28)
test_stat<-mean(treatment)-mean(control)
test_stat

## [1] 15.75

Resample: Start with the six scores and choose an SRS of two scores to form the treatment group for the first resample.

total<-c(57, 61, 42, 62, 41, 28)
sample(total, 2, replace=FALSE)

## [1] 61 57

treatmentB<-c(61, 62)
controlB<-c(57, 42, 41, 28)
test_statB<-mean(treatmentB)-mean(controlB)

Repeat part (b) 20 times to get 20 resamples and 20 values of the statistic. Make a histogram of the distribution of these 20 values.

treatment1<-c(41, 62)
control1<-c(57, 61, 42, 28)
test_stat1<-mean(treatment1)-mean(control1)
test_stat1

## [1] 4.5

treatment2<-c(57, 42)
control2<-c(62, 61, 41, 28)
test_stat2<-mean(treatment2)-mean(control2)
test_stat2

## [1] 1.5

treatment3<-c(42, 61)
control3<-c(62, 57, 41, 28)
test_stat3<-mean(treatment3)-mean(control3)
test_stat3

## [1] 4.5

treatment4<-c(42, 61)
control4<-c(62, 57, 41, 28)
test_stat4<-mean(treatment4)-mean(control4)
test_stat4

## [1] 4.5

treatment5<-c(62, 61)
control5<-c(42, 57, 41, 28)
test_stat5<-mean(treatment5)-mean(control5)
test_stat5

## [1] 19.5

treatment6<-c(42, 62)
control6<-c(61, 57, 41, 28)
test_stat6<-mean(treatment6)-mean(control6)
test_stat6

## [1] 5.25

treatment7<-c(61, 57)
control7<-c(28, 42, 62, 41)
test_stat7<-mean(treatment7)-mean(control7)
test_stat7

## [1] 15.75

treatment8<-c(57, 62)
control8<-c(41, 42, 28, 61)
test_stat8<-mean(treatment8)-mean(control8)
test_stat8

## [1] 16.5

treatment9<-c(57, 61)
control9<-c(62, 41, 42, 28)
test_stat9<-mean(treatment9)-mean(control9)
test_stat9

## [1] 15.75

treatment10<-c(62, 61)
control10<-c(41, 42, 57, 28)
test_stat10<-mean(treatment10)-mean(control10)
test_stat10

## [1] 19.5

treatment11<-c(62, 41)
control11<-c(28, 42, 57, 61)
test_stat11<-mean(treatment11)-mean(control11)
test_stat11

## [1] 4.5

treatment12<-c(42, 41)
control12<-c(62, 61, 57, 28)
test_stat12<-mean(treatment12)-mean(control12)
test_stat12

## [1] -10.5

treatment13<-c(28, 41)
control13<-c(42, 62, 61, 57)
test_stat13<-mean(treatment13)-mean(control13)
test_stat13

## [1] -21

treatment14<-c(62, 41)
control14<-c(61, 57, 28, 42)
test_stat14<-mean(treatment14)-mean(control14)
test_stat14

## [1] 4.5

treatment15<-c(62, 28)
control15<-c(61, 41, 42, 57)
test_stat15<-mean(treatment15)-mean(control15)
test_stat15

## [1] -5.25

treatment16<-c(57, 61)
control16<-c(62, 41, 42, 28)
test_stat16<-mean(treatment16)-mean(control16)
test_stat16

## [1] 15.75

treatment17<-c(57, 41)
control17<-c(62, 61, 28, 42)
test_stat17<-mean(treatment17)-mean(control17)
test_stat17

## [1] 0.75

treatment18<-c(57, 42)
control18<-c(62, 61, 41, 28)
test_stat18<-mean(treatment18)-mean(control18)
test_stat18

## [1] 1.5

treatment19<-c(41, 57)
control19<-c(62, 61, 42, 28)
test_stat19<-mean(treatment19)-mean(control19)
test_stat19

## [1] 0.75

treatment20<-c(62, 28)
control20<-c(61, 57, 41, 42)
test_stat20<-mean(treatment20)-mean(control20)
test_stat20

## [1] -5.25

TEST_STAT<-c(test_stat1, test_stat2, test_stat3, test_stat4, test_stat5, test_stat6, test_stat7, test_stat8, test_stat9, test_stat10, test_stat11, test_stat12, test_stat13, test_stat14, test_stat15, test_stat16, test_stat17, test_stat18, test_stat19, test_stat20)
hist(TEST_STAT)

All this down by hand with use of R functions. I’m having trouble setting up a one sample permutation, even though this problem was suppose to be done by hand anyway. Also I might be doing it wrong but doesnt (-) mean everything?

What proportion of the 20 statistic values were equal to or greater than the original value in part (a)?

TEST_STAT>=15.75

##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

6/15

## [1] 0.4

40% of the 20 statistic values were equal to or greate than the original value in part a).

For this small data set, there are only 15 ( 6 choose 2) possible permutations of the data. As a result, we can calculate the exact p-value by counting the number of permutations with a statistic value greater than or equal to the original value and then dividing by 15. What is the exact p-value here? How close was your estimate?

Problem 4

Make a histogram of the call lengths. Describe the shape of the distribution.

calls<-read.csv("calls80 (2).csv", header=TRUE, na.strings="?")
hist(calls$length)

The distribution of call lengths is strongly skewed right with the center (median) in the domain from 0-500. There is a possible outlier(s) in the domain of 2500-3000.

The central limit theorem says that the sampling distribution of the sample mean x becomes Normal as the sample size increases. Is the sampling distribution roughly Normal for n=80? To find out, bootstrap these data using 1000 resamples and inspect the bootstrap distribution of the mean.

bootStrapCI<-function(data, nsim){
  n<-length(data)
  bootCI<-c()
  
  for(i in 1:nsim){
    bootSamp<-sample(1:n, n, replace=TRUE)
    Xbar<-mean(data[bootSamp])
    bootCI<-c(bootCI, Xbar)
  }
  return(bootCI)
}

callBootStrap<-bootStrapCI(calls$length, nsim=1000)
hist(callBootStrap)

For n=80 with a bootstrap that uses nsim=1000, the sampling distribution appears to be roughly normal.

The central part of the distribution is close to Normal. In what way do the tails depart from Normality?

qqnorm(callBootStrap)
qqline(callBootStrap)

You can see on the QQ-Plot that the data gains distance from the line, both at the beginning and end of the data. This illustrates how the tails of the histogram depart form normality.

Create and inspect the bootstrap distribution of the sample mean for these data using 1000 resamples. Compared with your distribution from the previous part, is this distribution closer to or farther away from Normal?

calls2<-c(104, 102, 35, 211, 56, 325, 67, 9, 179, 59)
call2BootStrap<-bootStrapCI(calls2, nsim=1000)
hist(call2BootStrap)

qqnorm(call2BootStrap)
qqline(call2BootStrap)

This distribution holds a similar normality structure to part c) as the tails seem to depart from normality but it seems closer to being Normal as the distance from the line is slightly less.

Compare the bootstrap standard errors for your two sets of resamples (the one in PART I and the one from PART II). Why is the standard error larger for the smaller SRS?

SE1<-sd(callBootStrap)/sqrt(80)
SE1

## [1] 4.181344

SE2<-sd(call2BootStrap)/sqrt(10)
SE2

## [1] 9.316676

The standard error is larger for the smaller SRS because the numerator of the SE equation is naturally smaller with a smaller n making the overall SE larger after being divided with the standard deviation. Also, a smaller SRS might have data more spread out when compared to a larger SRS which has a higher chance of more accurate data. While this second concept is a possiblity it is not a for sure feat of differing sample sizes.

Problem 5

Use a side-by-side boxplot or faceted histograms to examine the data graphically (splitting by region). Does it appear reasonable to use standard t procedures?

trees<-read.csv("nspines (1).csv", header=TRUE, na.strings="?'")
head(trees)

##   ns  dbh
## 1  n 27.8
## 2  n 14.5
## 3  n 39.1
## 4  n  3.2
## 5  n 58.8
## 6  n 55.5

plot(trees$ns, trees$dbh)

As the side-by-side boxplots do not show a symmetirc structure that might suggest a roughly normal distribution, it does not appear reasonanle to use a standard t procedure. Also since the bowplots do not show modality, it is uncertain how many peaks the distributions could have and whether this would affect normality statements.

Calculate our observed statistic xNorth − xSouth.

obs_stat<-mean(trees$dbh[1:30])-mean(trees$dbh[31:60])
obs_stat

## [1] -10.83333

Bootstrap the difference in means xNorth − xSouth (at least 1000 times) and look at the bootstrap distribution.

bootStrapCI2<-function(data1, data2, nsim){
  n1<-length(data1)
  n2<-length(data2)
  
  bootCI2<-c()
  
  for(i in 1:nsim){
    bootSamp1<-sample(1:n1, n1, replace=TRUE)
    bootSamp2<-sample(n1:n2, n2, replace=TRUE)
    newXbar<-mean(data1[bootSamp1])-mean(data2[bootSamp2])
    bootCI2<-c(bootCI2, newXbar)
  }
  return(bootCI2)
}

bootStrapTrees<-bootStrapCI2(trees$dbh[1:30], trees$dbh[31:60], nsim=10000)
hist(bootStrapTrees)

Calculate both types of confidence intervals (quantile and hybird).

# Quantile Method
quantile(bootStrapTrees, c(.025, .975))

##       2.5%      97.5% 
## -18.353417  -2.733333

#Hybrid Method
se<-sd(bootStrapTrees)
obs_stat+c(-1,1)*qt(.975, df=29)*se

## [1] -19.027677  -2.638989

Comment on whether the conditions for the hybrid method (“bootstrap t-confidence interval”) are met. Do you believe this interval would be reliable?

Based on each sample (North and South) having n=30 and the histogram of bootStrapTrees appearing to be roughly normal, the Hybrid Method seems to a reliable method to create an interval based on the middle 95%.

Compare the bootstrap results with the usual two-sample t confidence interval. How do the intervals differ? Which would you use?

seT<-sqrt((sd(trees$dbh[1:30])/30)+sd(trees$dbh[31:60])/30)
obs_stat+c(-1,1)*qt(.975, df=29)*seT

## [1] -12.937650  -8.729016

The usual two-sample t confidence interval is much smaller then either the Quantile or Hybrid bootstrap method although all three intervals are completely negative. I would use either the Quantile and Hybrid Method as they seem more accurate having drawn from the results of 10000 simulations compared to a small sample size of barelty 30 (typical t confidence interval).